PROCEEDINGS OF 
THE FIRST 
GLOBAL WORDNET 
CONFERENCE 





Central Institute of Indian Languages 
Mysore, India 








I s ' International 
Global WordNet Conference 

January 21 — 25, 2002 


Proceedings 


Foreword by 

Prof. Udaya Narayana Singh 


Introduction by 

Dr. Piek Vossen 
Dr. Christiane Fellbaum 


Central Institute of Indian Languages 

Mysore, India 



Proceedings of 1** International 
Global Wordnet Conference 
January 21-25,2002 


Fust Published : January 2002 
Pausba 1923 


© Central Institute of Indian Languages, Mysore 2002 

This material may not be reproduced or transmitted, either in part or in full, in any form or by any means, electronic, 
or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission 
in writing from: 

Prof. Udaya Narayana Singh 
Director 

Central Institute of Indian Languages 
Manasagangotri, Mysore-570 006 



INDIA 


Phone : 0091/ 0821-515820 (Director) 

PABX 

0091/ 0821-515558 

Telex : 0846-268 CIIL IN 

Grams 

BHARATI 

E-mail : udaya@ciil.stpmy.softnet (Director) 

Fax 

: 00-91/0821-515032 

bhasha@sanchamet.in 

Website 

http://www.diLorg 


For Information Contact: 

Dr. K.S. Rajyashree, Head, Publications & Public Relations 


Phone : 0091/0821 -412021 

E-mail 

rajya@riiIstpmy.soft.net 


<1 



Price: Rs. 160.00 (US $.18.00) 


Published by Prof. Udaya Narayana Singh, Director, 

Central Institute of Indian Languages, Mysore and 
Printed by Mr. S.B. Biswas, Manager 
CIIL Printing Press, Manasagangotri, Mysore-570 006, India 
Cover art by Prof. Udaya Narayana Singh 



ll 




FOREWORD 


Words have their own space, which sometimes overlies ours, but such overlaps do not necessarily 
happen, or at least, do not always happen. Some of us assume that we, the human beings, are keen to 
understand everything in terms of binary opposites. There have also been claims that we fail to 
understand things that defy all kinds of categorization. Of course, from our everyday life and 
experience, we are aware that we feel very uneasy when a phenomenon refuses to fall into any known 
and neat pattern or is beyond our conventional classifications. But looking at words has taught us that 
the same element could simultaneously, or at different times, i.e. in different contexts, fall in several 
classes - grammatical, semantic, functional. 


In fact, each word kindles a kind of imagination in our mind - lights thousand lamps, as it were. Each 
word hides a story beneath the face it wears. Or, better say - each one of these apparently innocent 
formations bubbles with a lot of tale that it could tell us. When you wish to confront them looking for 


those hidden senses, they would show you certain lanes and abysses which would lead.to some other 
words - all looking, at least in terms of their senses, similar but are still somewhere differing in 
meaning. When you attempt to probe into any one of the others in this set, you realize that each one of 
these others is fiirther linked to yet others in a chained manner. It is this network of chains that 
seemed ever elusive and impossible to track down until George Miller came into the picture. But his 
was still an English sky, and it took many monsoons before the rainbow which we see today emerged, 
when one began to look at different kinds of skies, or rather, at many skies together - plotting one on 
the other, which enabled the differences and similarities appear clearly. 


What does this sky gazing tell us? To many of us, who would not give up easily, this way of looking 
at things reveal, newer and newer sites of human mind. In trying to understand how our mind works, 
words thus provide a powerful tool. But we do not want this, whatever we are doing as a part of 
wordnet activity, to begin with and end in the study of words. These studies should lead us to some 
other area of knowledge - especially to that science which enhances our knowledge (or, ignorance) of 
how human mind works. If human languages are ‘species generate’, then looking at relations among 
entities, rather than entities in themselves, will tell us that we have crossed the threshold of the world 
of grammar and lexicon, and are now well-entrenched into the mysterious (and often, enigmatic) 
universe of semantics. ' 


The personal lesson that I have learnt is that when you are trying to understand - not merely words 
which are a piece of earth (each because of their rooted ness to a given land and culture), but what 
gets reflected or mirrored through them - an entire time - open-ended and resilient with an 
inexplicable supple - a vast expanse of space, or a set of legends, folklore and myths, nothing that we 
study in all social sciences and humanities put together seem to go waste. We begin to discover newer 
and newer connections. In this context, let me recount my recent experience in st^Jjling o^ a story 
on the net written by an eleven year old Laura Beeston from Winnipeg, Manitobain^Cpn^da.qn what 
she thought as to the origin of sky, I was amazed to see the kind of ideas otir children ^.e^pable of 
having deep in their unconsciousness, which they could reveal if challenged. What rijakes human 
creativity unique, or makes us qualify for the epithets like ‘species specific is that we are ever open 
to generating new meanings and relationships among things that we have always kno\yn to be in 
certain specific ways. • * • ‘.V/.v 


Laura had understood, or at least, imagined words and their meanings differently. In other words, she 
was redefining many words here. Let me quote her story to exemplify what I mean. Reading this piece 
makes the point obvious. (Let us ignore the proper names she had chosen.) Here’s how it goes where 
the underscoring (of the noVel ideas and expressions) is mine: 

Before the stars, when the earth was Iona and flat , and the moon was cold anc plain , there lived a spirit named Obweji. Obweji was very 
powerful because he was the Spirit of the Sky . He owned the universe and the gr eat flat earth . He had many servants on earth; they were aN 
afraid of him and did his bidding . 

They were all happy when Obweji lifted and went up to his big dark sky again. 


HI 



On earth there was a beautiful maiden named Pateka. Her hair was the colour of a raven’s wing and her eyes sparkled like the fire. She was 
kind and respectful to her people and she loved the sun . 

One day Obweji came down to Pateka's village to choose a bride . 

AH the people were obligated to give forth their daughters and the families who didn’t were killed. 

When Obweji saw Pateka he chose her right away because of her great beauty. 

When Pateka was given to him she cried and cried but went with him for the sake of her people. She cried many nights after and the only thing 
that pleased her was staring into the sun for many hours. Obweji was sorry but very mad at Pateka for being disrespectful to him. One night 
Pateka told Obweji, ’ I am leaving you because you are cruel to my people!’ The words shocked Obweji and he became furious at her. He 
grabbed her stone necklace from her neck. But Pateka was too quick and darted away from him. He held the broken necklace as she ran away 
never to be seen again. Obweji was so overcome by anger that he threw all the beads in the sky, and there they stopped and shone like 
diamonds in day and night Obweji was ashamed and scared to face his people , he felt weak and thought about his lost love all the time, A few 
days later Obweji started to cry . 

He picked up his earth and rolled it very slowly in his big hands then he turned faster and faster until his palms hurt and his head ached. He 
was so sad that he went to the moon and slept for 4 days. 

Then he died of a broken heart 

The servants were overjpyed because they didn't have to work for him anymore. But they would be reminded of him when they looked at his 
image on the moon. 

And Pateka? She ran to the spirit of the sun and married him . They were very happy and had many children. 

They named them, Venus. Mercury. Mars, Jupiter. Saturn, Uranus, Pluto and Neptune. They lived happily ever after. 

And that is how the sky came to be . 


The aestheticians would of course discover many more qualities in the story such as who the 
protagonist was, how Ojibwa was an anti-hero, and Pateka was a truly liberated person, or even the 
originality of this modem ‘Origin’-myth. But notice how the following five word-sets were 
conceptualised differently, which had made all the difference to her creative piece: (i) earth, sky, 
moon, sun, (ii) live, die, kill, be furious, (iii) love, respect, fear (iv) go, go up, come down, (v) break, 
dart, throw and (vi) sad, be sorry, be ashamed, be mad, joy, be overjoyed. Many of these do not 
qualify to be synsets in the WordNet tradition, and yet, are intimately related to one another to be 
called wordsets of a kind. The entire story practically hinges on the re-interpretations as to the above 
relationships - some antonymic, some hyponymic or even hypemymic and others, holonymic. 

I guess the time when we would be able to delve deeper and deeper into the universe of words - 
thrown by our own culture to shine all day and night in our skies, and when intricate interweaving of 
concepts employed by the most sophisticated and careful to the least prepared but instant story-tellers 
and poets of our times would be possible to capture in formulaic fashion as an extension of wordnet 
research activities, I shall have a greater satisfaction. The satisfaction will not be because the whole 
meaning universe will be possible to describe in terms of formulae, but because it will probably take 
us closer to understand how human mind works. 

Until then, here we are, with a challenging set of conference papers - all carefully crafted and chosen 
with equal care by a double-blind method, and we hope to be able to bring out the fruits (Mind my 
Words!) of some of the exciting research activities going on in so many countries - to be read, 
discussed and debated over the week beginning January 21, 2002. We must appreciate the gesture of 
the international wordnet community for having chosen India and CIIL, Mysore as the venue of the 
first ever GWN meet. In the years to come, let the acronym draw more and more scholars currently in 
pursuit of understanding cognition come together in future years on other platforms. As for now, 
enjoy the wide variety of topics being discussed under the un-common noun, wordnet. 

Udaya Narayana Singh 


iv 



INTRODUCTION 


WordNets on a Global Scale 

When George Miller began an experiment designed to test a model of human semantic memory, he 
did not foresee the scope and the impact of the resulting database, WordNet. Today, we cannot tell 
whether the experiment proved the semantic network theory to be wrong or right. But we can say that 
the WordNet model has become a new kind of lexicography. Looking back at a 15 year-old 
experiment, some aspects of its development raise profound questions. Why were glosses added to the 
synsets if the words' meanings were supposed to be defined entirely by their relations to other words 
and synsets? Why are there distinct synsets that contain identical words? Why are there sense 
numbers? And so forth. 

Should we be concerned with these questions? Probably not. Questions will continue to be asked and 
nobody can tell what kinds of answers future developments will hold. What we do know now is that 
WordNet has spawned completely new fields of research and stimulated many others, with the result 
that some old questions have been answered and new ones have arisen. We fell strongly that, thanks 
to WordNet, we now have a fantastic opportunity to actually test some fascinating aspects about 
language, culture and thought. And perhaps we should add to this list teaching, technology and even 
politics. 

WordNet has not only migrated across the oceans but beyond into subcultures and specialized 
domains, and has become WordNet, a common noun. We now have numerous wordnets in different 
languages; more are being built today. Even better, most of these wordnets are interconnected so that 
they can be compared, can inform each other, and can bring people of different backgrounds and 
cultures together. Sapir and Whorf could only dream about the kind of power we have when we 
merely push a few buttons. We can compare not only the lexicons of languages all over the world, but 
also a general vocabulary with those of sub-languages, genres, chat-groups, and age groups. Such 
information can teach us about differences in culture and language systems. We can compare 
experiments on classification and word sense disambiguation across languages and language types. 
We are able to use one language to process another. 

Communication via the web will be greatly helped by wordnets, and web communication in turn will 
aid the development of wordnets. The Web is a huge empirical and experimental body of naturally 
occurring linguistic material. Connecting language data from the web with wordnets in many 
languages will be a big challenge for the coming decade. On the one hand, it will bring context to the 
abstract concepts in wordnets and, on the other hand, it will assist the development of the semantic 
web in a linguistic sense, respecting differences in culture and lowering the threshold for billions of 
people in the world that do not speak English. 

The first International WordNet Conference provides the opportunity for reflection on the range and 
depth of WordNet's impact. The conference presents reports on the construction of new wordnets in a 
variety of languages and language families and contributions on the methodology for building 
wordnets. There are papers on sublanguages and domains and reports on the usage of wordnets for 
word sense disambiguation and for classification. Other contributions center on formal aspects of the 
network structure, ontological structure, and on fundamental aspects of lexical semantic relations. 
There are practical papers on WordNet databases and on the visualization and navigation of wordnets. 
Applications-oriented contributions tell about the use of wordnets for teaching, text mining, web 
navigation and text evaluation. 

This conference aims to bring together these many different interests and to generate new ideas and 
thoughts. We sincerely hope that it will further strengthen the globalisation of lexical and cultural 
individualisation and, not the least, will increase global respect for the diversity of human language. 

On behalf of the Global WordNet Association 


Piek Vossen and Christiane Fellbaum 



ACKNOWLEDGEMENT 


The proceedings of the 1 st International Conference on Global WordNet was possible because of co¬ 
operation of all the authors and reviewers. I would like to express sincere thanks to all the authors 
and members of the review committee for their prompt submissions within deadlines. 

Involvement and constant guidance of Dr. Christiane Fellbaum and Dr. Piek Vossen was extreamly 
valuable in not only organizing the 1 st International Conference but also publishing this volume. I 
express my gratitude and thanks to both of them. 

The support, direction and help provided by Prof. Udaya Narayana Singh, the Director of the Institute 
has shaped this conference and the volume. I am grateful to him for all the encouragement as well for 
the foreword. 

I would like to record my thanks to Ms. Poomima and Ms. Lakshmi for formatting all the papers and 
preparing a press-ready manuscript in short time. 

Excellent conference site developed by Mr. B.P. Malthesh of Unicomp has helped to bring out this 
volume in record time because of online submission of papers and reviews. My sincere thanks to 
Mr. B.P. Malthesh. 

I thank the staff of CIIL press for their timely co-operation in bringing out this volume. 

B.D. Jayaram 
Secretary, GWN 


vi 



Contents 


Foreword 


Prof. Udaya Narayana Singh 


Introduction - Dr. Piek Vossen and Prof. Christiane Fellbaum 

Acknowledgement - B.D. Jayaram 

Building WordNets 

ItalWordNet: A Large Semantic Database for the Automatic Treatment of the Italian 
Language 

Adriana Roventini, Antonietta Alonge, Francesca Bertagna, Nicoletta Calzolari, Rita 
Marinelli, Bernardo Magnini, Manuela Speranza. 

BALKANET: A Multilingual Semantic Network for the Balkan Languages 

Stamou Sofia, Oflazer Kemal, Pala Karel, Christoudoulakis Dimitris, Cristea Dan, Tufis Dan, 
Koeva Svetla, Totkov George, Dutoit Dominique, Grigohadou Maria. 

An Ontology and a Semantic Network for Danish Times Adverbs - based on the 
SIMPLE Lexicon Model 

Sanni Nimb 

Adjectives In WordNet-Type Thesaurus: Estonian Experience 

Heili Orav 

Estonian WordNet Benefits from Word Sense Disambigdat on 

Neeme Kahusk, Kadri Vider. 

Methodological Issues in the Building of the Basque WordNet: Quantitative and 
Qualitative Analysis 

Eneko Agirre, Olatz Ansa, Xabier Arregi, Jose Mari Arriola, Arantza Diaz de llarraza, Eli 
Pociello, Kepa, Sarasota, Larraitz Uria. 

Expanding the EWN with Domain-Specific Terminology Using Common Lexical 
Resources: Vocabulary Completeness and Coverage Issues 

Stamou Sofia, Ntoulas Alexandros, Kyriakopoulou Maria, Christodoulakis Dimitris. 

Semi-Automatic Construction of Korean Noun Thesaurus by Utilizing Monolingual 
MRD and an Exisitng Thesaurus 

Juho Lee, Koaunghi Un, Key-Sun Choi. 

A Tree-structure Solution for the Development of Chinese Met 

Liu Yang, Yu Jiangsheng, Yu Shiwen. 

Experiences in building the Indo WordNet: A WordNet for Hindi 

Debasri Chakrabarti, Dipak Kumar Narayan, Prabhakar Pandey, Pushpak Bhattacharyya. 

Tamil Wordnet 

Devi Poongulhali, P.Kavitha Noel N, Preeda LakshmlR, T.V Geetha, A.Manavazhahan. 

Disambiguation and Semantic Annotation 

Word Sense Disambiguation Using Semantic Graph 

Narayanan Unny E, Pushpak Bhattacharyya. 

Validity of Noun Semantic Networks for Korean Word-Sense Disambiguation 

Yoo-Jin Moon, Kyungho Min. 

ItalWordNet in an Annotation Task: A Chance for Discussion 

Claudia Soria, Francesca Bertagna, Nicoletta Calzolari. 


VII 


iii-iv 

v 

vi 

1-11 

12-14 

15-21 

22-25 

26-31 

32-40 

4146 

47-50 

51-56 

57-64 

65-71 

72-80 

81-87 

88-99 



Ontologies, Concepts, Top Levels 


Distinguishing Concepts and Instances in WordNet 

Enrique Alfonseca, Suresh Mahandhar. 

Cleaning-up WordNet's Top-Level 

Aldo Gangemi, Nicola Guarino, Alessandro Oltramari. 

Chinese Characters and Top Ontology in EuroWordNet 

Shun Ha Sylvia Wong, Karel Pala. 

Evolution of WordNet-Like Lexicon 

Yu Jiangsheng 

Assigning Domain Labels 

EUROTERM: Extending EWN using both the Expand and Merge Model 

Stamou Sofia, Ntoulas Alexandros, Hoppenbrouwers Jeroen, Saiz-Noeda Maximiliano, 
Christodoulakis Dimitris. 

Comparing Ontology-based and Corpus-based Domain Annotations in WordNet 

Bernardo Magnini, Carlo Strapparava, Giovanni Pezzulo, Alfo Gliozzo. 

Induction of Classification from Lexicon Expansion: Assigning Domain Tags to 
WordNet Entries 

Echa Chang 

Interfaces » . . 

A Graphical Tool for Browsing, Searching, and Annotating WordNet 

Arthur Cater 

Adapting GermaNet for the Web 

Claudia Kunze, Lothar Lemnitzer. 

Visualizing WordNet Structure 

Jaap Kamps 

Structured Access to Scientific Information 

Caterina Caracciolo, Maarten de Rijke. 

VisDic - A New Tool for WordNet Editing 

Tomas Pavelek 

WordNet Web Navigation Interface: A Fast Interface to Navigate EuroWordNet 
Hierarchies 

Eneko Agirre, Olatz Ansa, Xabier Arregi, Kike Fernandez. 

Storing and Retrieving WordNet Database (and Other Structured Dictionaries) in XML 
Lexical Database Management System 

Pavel Smrz 

Sublanguages 

An Architecture for Engineering Sublanguage WordNets 

Kalyan Moy Gupta, David W. Aha, Elaine Marsh, Tucker Maney. 

Extending Synsets with Medical Terms 

Paul Buitelaar, Bogdan Sacaleanu. 


100-108 

109-121 

122-133 

134-142 

143-145 

146-154 

155-164 

165-173 

174-181 

182-186 

187-191 

192-195 

196-200 

201-206 

207-215 

216-222 


Characterizing the Definitions of Anatomical Concepts in WordNet and Specialized 
Sources 

Olivier Bodenreider, Anita Burgun. 


223-230 



Applications 

EuroWordNet as a Resource for Learning Spanish Verbs 

Roser Morante, M. Antdnia Marti. 

The Wordnet as a Vocabulary Management Tool for Indexing Language 

Hemalata Iyer, B.A.Sharada. 

Bulgarian WordNet as a Source for (Psycho) Linguistic Studies 

Krassimira Petrova, Toma Nikolov. 

Lexicons in an Object-Oriented Grammatical Model For Universal Grammar-Based 
Machine Translation (UGBMT) 

Yukiko Sasaki Alam, Shahid Alam. 

Semantic Based Text Mining 

D.Manjula, Malliga, T.V Geetha. 

Tamil WordNet 

S Rajendran, S Arulmozi, B Kumara Shanmugam, S Baskaran, S Thiagarajan. 

Oriya Word Net 

S.Mohanty, N.B.Ray, R.C.B.Ray, P.K.Santi. 

Indigenous Knowledge Systems In The Global Wordnet: Focus on Car Nicobarese 

R. Elangaiyan 

Aligning WordNets / Crosslinguistic Work 

Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and 
HowNet 

Marine Carpuat, Grace Ngai, Pascale Fung, Kenneth W. Church. 

MultiWordNet: Developing an Aligned Multilingual Database 

Emanuele Pianta, Luisa Bentivogli, Christian Girardi. 

An Unsupervised Method for General Named Entity Recognition and Automated 
Concept Discovery 

Enrique Alfonseca, Suresh Mahandhar. 

Lexical Semantics 

Automated Discovery of Telic Relations for WordNet 

Marco De Boni, Suresh Manandhar. 

Nouns in WordNet and HowNet: An Analysis and Comparison of Semantic Relations 

Ping-Wai Wong, Pascale Fung. 

Integrating Selectional Preferences in WordNet 

Eneko Agirre, David Martinez. 

Words with Attitude 

Jaap Kamps, Maarten Marx. 

Metaphoric Expressions: An Analysis of Data from a Corpus and the ItalWordNet 

Hatahacp 

Antonietta Alonge, Margherita Castelli. 

What Does It Mean To Be a Shelf? Semantic Bleaching and WordNet 

Sandiway Fong 

Cross-Linguistic Discovery of Semantic Regularity 

Wim Peters, Louise Guthrie, Yorick Wilks. 


ix 


231-238 

239-250 

251-259 

260-265 

266-270 

271-274 

275-278 

279-283 

284-292 

293-302 

303-311 

312-318 

319-322 

323-331 

332-341 

342-350 

351-359 

360-368 



i 

•r 



ItalWordNet: A Large Semantic Database 
for the Automatic Treatment of the Italian Language 


Adriana Roventini 
Antonietta Alonge 
Francesca Bertagna 
Nicoletta Calzolari 
Rita Marinelli 
Bernardo Magnini 
Manuela Speranza 
Antonio Zampolli 


ABSTRACT 

This paper describes the main characteristics of the ItalWordNet semantic database, built within the 
SI-TAL Italian National Project. The database was created by extending the Italian wordnet 
developed within the EuroWordNet project by adding i) adjectives, adverbs and proper nouns (not 
dealt with within EuroWordNet); ii) a terminological subset related to the economic-financial domain. 
The relevant changes involved by these extensions both in the linguistic model and in the data 
structure are illustrated. 

1. Introduction 

SI-TAL (Integrated System for the Automatic Treatment of Language) is a National Project devoted 
to the creation of large linguistic resources and software tools for the Italian written and spoken 
language processing. Among the resources developed in SI-TAL 1 ItalWordNet (IWN) has been built 
as the reference semantic database, by extending the Italian wordnet developed within the 
EuroWordNet (EWN) project This was achieved by inserting primarily adjectives, adverbs and a 
subset of proper nouns, but also nouns and verbs which had not been taken into consideration in 
EWN. Although we basically used the EWN model of lexical-semantic relations to build IWN, we 
identified additional relations, mainly to be used to encode data on adjectives (inserted in EWN only 
as targets of relations from nouns and verbs). Moreover, we added a terminological subset, related to 
economy, providing the possibility to access either only the generic lexicon or only the specialized 
set, or even both subsets at the same time. We describe the overall architecture of the IWN database, 
the ‘new’ semantic relations encoded and the characteristics of the terminological subset. 

2. Overall Architecture of the IWN Database 

The IWN database is made up of the following components: 

i) a generic wordnet containing 45,290 lemmas corresponding to 49,127 synsets; 

ii) a (generic) Interlingual-Index (ILI), used in EWN to link wordnets of different languages. Also m 
IWN we linked the synsets encoded in the generic wordnet to the ILI, to make the resource usable 
in multilingual applications; 

iii) a terminological wordnet, linked to the generic wordnet, containing 5,130 lemmas corresponding 
to 4,687 synsets belonging to the economic-financial domain; 

iv) a terminological ILI, containing synsets related to the economic-financial domain; 


1 Besides IWN, the following tools and resources have been developed within the project: a trcebank with a three levels 
syntactic and semantic annotation, a system for integrating NL processors in applications for managing gra™nahcd 
resources, a dialogues annotated corpus for applications of advanced vocal interfaces, software and tools for advanced vocal 

interfaces. 



v) the Top Ontology (TO), mostly inherited from EWN but partially modified in IWN to account for 
adjectives and adverbs. Via the ILIs, all the concepts in the generic and specific wordnets are 
directly or indirectly linked to the TO 2 ; 

vi) the Domain Ontology (DO), containing a set of domain labels. Via the ILIs, all the concepts in the 
specific wordnet are directly or indirectly linked to the DO. 

All these components and their reciprocal links are shown in figure 1. 



Figure 1: The architecture of the IWN database 
3. The IWN Linguistic Model 

The basic notion around which the IWN database is built is the same around which both WordNet 
(WN) and EWN are built, i.e. that of a synset. Various semantic relations are encoded between 
synsets, however the network built is mainly based on the hyponymy (or IS-A) relation. 

As in EWN, in IWN the networks are not separated on the basis of their parts of speech (PoSs), but a 
distinction is drawn among the semantic orders of the entities to which word meanings refer (Lyons 
1977): 1 order entities (concrete nouns), 2 nd order entities (verbs, adjectives or nouns indicating 
properties, states, processes or events), and 3 rd order entities (abstract nouns indicating propositions 
independent of time and space). Thus, cross-PoS relations between words referring to similar concepts 
and belonging to the same semantic order are encoded. In addition, we also use other cross-PoS 
relations which emphasize the language-specific lexicalisation patterns of semantic components. IWN 
has also inherited from EWN the distinction between language-internal relations and equivalence 
relations . The latter are, in principle, similar to the former, but applied between synsets of the 
language-specific wordnet and the ILI. 

3.1. Language-Internal Lexical-Semantic Relations and Features 

IWN has inherited the EWN language-internal relations and features (Alonge et al., 1998) with some 
minor changes. However, while in WN all major PoSs are represented, in EWN only data for verbs 
and nouns were encoded. In order to encode information on adjectives and adverbs we have thus 
enriched our set of relations by taking into consideration: i) how these categories are treated both in 
WN and in other theoretical works; ii) the EAGLES recommendations on semantic encoding 
(Sanfilippo et al., 1999); iii) the data available in our sources; iv) possible use of data encoded in 
computational applications. 

Since we presuppose the reader’s acquaintance with the EWN system of semantic relations and 
features of relations used, we only provide a description of the ‘new’ language-internal relations 
encoaea in i win, snowing similarities and ditlerences with either the WN or the EWN system. 


This was performed by linking all the most important top nodes of hyponymic hierarchies (the Base Concepts) to one or 
more TO features. Such features are then inherited via hyponymy by all the concepts within the woidnet. 


2 


3.1.1. INTERNAL RELATIONS 


The hyperonymy/hyponymy relation is the most important relation encoded for noun and verb synsets 
both in WN and EWN. This is due to the possibility it provides to identify classes of words for which 
it is possible to draw generalizations and inferences. 

While EWN contains detailed information only on nouns and verbs and therefore there are no 
hyponymy relations between adjective or adverb synsets, the lack of such a relation for adjectives and 
adverbs in WN is mainly due to theoretical reasons. In WN adjectives are divided into two major 
classes: descriptive adjectives and relational adjectives. A descriptive adjective is “one that ascribes a 
value of an attribute to a noun” (Fellbaum et aL, 1993: 27). Typically, in this group we find adjectives 
that designate the physical dimension of an object, its weight, abstract values, etc. Relational 
adjectives, on the other hand, mean something like “relating/pertaining to, associated with”, and 
usually have a morphologically strong link with a noun. Typical examples are musical , atomic , 
chemical. The organization of descriptive adjectives in WN can “be visualized in terms of barbell-like 
structures, with a direct antonym in the centre of each disk surrounded by its semantically similar 
adjectives (which constitute the indirect antonyms of the adjectives in the opposed disk)” (Fellbaum, 
1998: 212). The main relation encoded for these adjective synsets is antonymy, claimed to be the most 
prominent relation, both from a psycholinguistic and a more strictly lexical-semantic point of view, in 
the definition of descriptive adjectives. Hyponymy is substituted by a ‘similarity’ relation. 

The semantics of relational adjectives cannot be described by using these relations. Indeed, they only 
point to the noun to which they pertain (e.g. atomic is linked to atom). 

Although we consider antonymy as the basic relation to define the semantics of most descriptive 
adjectives, we reconsidered the possibility of encoding hyponymy for this category. By analysing data 
coming from machine-readable dictionaries we found subsets of adjectives which have a genus + 
differentia definition, like nouns or verbs. These adjectives were organised into classes sharing a 
superordinate. This is the case, e.g., of adjectives indicaing a ‘containing’ property (acquoso - 
watery; alcalino - alkaline), or a ‘suitable-for’ proper.y (difensivo - defensive; educative - 
educational), etc. By encoding a hyponymy relation for them we obtained classes for which it is 
possible to make various inferences. For instance, it is possible to infer their semantic preferences: 
e.g., all the hyponyms of the {contenente 1} synset (‘containing’ in its literal sense) will occur as 
attributes of concrete nouns while all the hyponyms of {piano 3} (‘full’ in its metaphoric sense), as 
for example orgoglioso (proud), rabbioso (furious, angry), will occur as attributes of either concrete 
or abstract nouns; adjectives found in the taxonomy of {malato 1, sofferente 1, infermo 1, ammalato 
1} (suffering from an illness) will never be predicated of nouns referring to objects, etc. Furthermore, 
it is possible to infer information on syntactic characteristics of adjectives found in the same 
taxonomy: e.g., in Italian, the hyponyms of {atto 1, adatto 1, congruo 1} (suitable for) are always 
found in predicative position (and do not accept any complements); the hyponyms of {privo 1, 
sprowisto 1} (lacking) may occur both in attributive and predicative position (and may take certain 
prepositional complements), etc. Hyponymy was also encoded across PoSs: e.g., entrata (entering) is 
linked to andare (to go) by means of a XPOS_HAS_HYPERONYM relation. 

In EWN two antonymy relations are encoded, i.e. ANTDNYMY, expressing meaning opposition 
between variants, and NE AR_ ANTONYMY , expressing synset oppositions. In IWN we assumed that, 
since a synset contains different expressions for the same concept, it should not be possible to find an 
antonvm of one of such exnressions which is not antonvm of the others. Thus, we onlv encoded an 
ANTONYMY relation between synsets. However, besides the underspecified antonymy relation we 
added the possibility to encode more specific sub-relations. Following theoretical works (Lyons, 
1977; Cruse, 1986), we further distinguished between COMPL_ANTONYMY and GRAD_ANTONYMY . 
The former relation links synsets referring to opposing (complementary) properties/concepts: when 

3 A similar distinction is also made within the SIMPLE EC project (LE-iI346), whose goal was adding semantic information 
to the set of harmonized lexicons built within the PAROLE project for twelve European languages. 


3 



one holds the other is excluded ( alive/dead ). The latter relation is used for those antonym pairs which 
refer to gradable properties ( long/short ). In case it is not clear whether two opposing synsets refer to 
complementary or gradable concepts, we can still use the underspecified ANTONYMY relation. This 
information can be useful for computational applications since word pairs presenting one of the two 
kinds of opposition may occur in different contexts (Cruse, 1986). Also antonymy was encoded across 
PoSs, by means of the XPOS_ANTONYMY relation: e.g., arrivo (arrival) is a XPOS ANTONYM of partire 
(to leave). ~ 

A relation used in WN links an adjective to the noun of which it expresses a ‘value’. For instance, tall 
expresses a value of stature. This relation is also encoded in IWN, since it may be useful both to 
distinguish between adjective senses and to point out the adjective semantic preferences. 


The EWN DERIVATION relation is used in IWN to encode derivation links when no other semantic 
relation is available and it connects variants belonging to different PoSs: grande (great) DERIVATION 
grandemente (greatly) 4 . 

The BELONGS_TO_CLASS/HAS_INSTANCE relation is used to link proper nouns to the class of common 
nouns to which they belong: Roma BELONGSJTO_CLASS cittd (city); Amo BELONGS TO CLASS fmme 
(river). - “ 

A LIABLE_TO/HAS_LIABILITY relation has been added in IWN to encode information on a large group 
of deverbal adjectives expressing the possibility of an eventuality occurring: giudicabile (judgeable) 
LIABLE_T0 giudicare (to judge) (see Roventini et al., forthcoming, for a more detailed description of 
internal relations in IWN). 

3.2. Equivalence Relations to the ELI 

The equivalence relations between the IWN synsets and the ILI are defined similarly to internal 
relations. Thus, for instance, SYNONYMY and EQ_SYNONYMY can be defined in a similar way, the 
only difference being that the latter holds between a synset in the Italian wordnet and a synset in the 
ILI: e.g., an EQ_ SYNONYMY relation is encoded between [acrimonia 1, amarezza 3} and {acrimony, 
bitterness, acerbity, jaundice). Moreover, to encode equivalence relations we only use underspecified 
relations. 

With respect to the relation encoded in EWN we added further equivalence relations, like 
EQ XPOS NEAR SYNONYMY and EQ_BELONGS_TO_CLASS. The former has been added to ’more 
properly map to the ILI those Italian synsets which are only expressed in English by means of a 
different PoS, the latter to map proper nouns which have no equivalent(s) in the ILI. 

We tried to encode at least an EQ_SYNONYM or EQ_NEAR_SYNONYM relation for each Italian synset 
in the wordnet. When we did not find satisfying mappings, we encoded an EQ_HAS_HYPERONYM 
relation plus, when possible, any other equivalent relation which might help to describe our synset 
meaning more precisely. 

33. The IWN Top Ontology 

The TO is a hierarchy of language-independent concepts, reflecting fundamental semantic 
distinctions, built within EWN and partially modified in IWN to account for adjectives and adverbs 
(not dealt with in EWN). Via the ILI, all the concepts in the generic and specific wordnets are directlv 
or indirectly linked to the TO. 


4 


Actually, this relation has been primarily used to create the subset of adverbs derived from adjectives. 


4 



Here below the IWN Top Concepts hierarchy for 2 nd order entities is shown: 


2 nd Order Entity 

Situation Component 

Cause j 

Communication 

Condition 

Existence 

Experience 

Location 

Manner 

Mental 

Modal 

Physical 

Material 
Physiologic: i 
Possession 
Purpose 
Quantity 
Social 
Time 
Intensity 
Property 

Attribute 

Functional 

Relation 

Situation Type 
Dynamic 

BoundedEvent 

UnboundcdEvert 

Static 

Table 1: The rWN 2 nd Order Top Concepts 

In order to be able to draw generalizations on adjective meanings by using the TO, we partially 
modified the EWN scheme (modifications are indicated by bold characters). First of all, we moved 
the PROPERTY and RELATION nodes (which in EWN arc found under the node STATIC) under the 
SITUATION COMPONENT node. This was done for two interconnected reasons: firstly, because this 
distinction is not directly linked to Aktionsart (lexical aspect), while the distinctions under SITUATION 
TYPE are Aktionsart distinctions. Secondly, adjectives may refer to PROPERTIES or RELATIONS, but 
they may be either stative or not (e.g. Lakoff, 1966; Quirk et al ., 1985; Peters et al., 1999). Thus, in 
our system it had to be possible to specify that an adjective expresses a PROPERTY while being 
DYNAMIC. In any case, since many adjectives may have both a DYNAMIC sense and a STATIC one, we 
have also the possibility to underspecify this information by linking adjectives directly to the 
SITUATION TYPE node. 

Adjectives may indicate many different types of properties: temporal (passeggiata mattutina - 
morning walk), psychological (« canzone triste - sad song), social ( uomo ricco - rich man), physical 
(superficie legnosa - wooden surface), etc. In the EWN TO there are already nodes which may be 
used to represent these distinctions (TIME, MENTAL, SOCIAL. PHYSICAL, QUANTITY) but we needed to 
better specify or also add some features. For example, we hsve added, under the already present node 
PHYSICAL, the node MATERIAL, to represent, among others, some Italian adjectives which indicate the 
oronertv of containing a certain material. Moreover, we added the node PHYSIOLOGICAL I to classify 
adjectives corresponding to tired, hungry , sick, etc.) under PHYSICAL. For adjectives denoting an 
intensity, we then added the node INTENSITY directly under the SITUATION COMPONENT node. 

One of the main problem we had was that no Top Concept in the EWN TO could be used to classify 
the reference-modifying adjectives (e.g., former , actual). These do not indicate a property of the 
referent of the noun they modify. So, aiming at showing tie distinction between referent-modifiers 
and reference-modifiers, we created two new Top Concepts under the node PROPERTY: ATTRIBUTE 


5 




and FUNCTIONAL, where the latter can be used for reference-modifying adjectives (according to the 
definition provided by Chierchia and McConnel-Ginet, 1990 for the category referred to by these 
adjectives: “a Junction from properties to properties”). 

Like all descriptive adjectives, also the reference-modifiers classified under the node FUNCTIONAL can 
be linked to other SITUATION COMPONENTS. Functional adjectives for which the temporal aspect 
prevails (ex - former, presente - present) were classified under the node TIME; adjectives referring to 
some epistemological property (potenziale —potential, necessario — necessary) were linked to 
MODAL ; etc. A particular case of functional adjectives are the ‘argumental’ ones. They introduce a 
comparison between different entities (e.g., simile - similar, diverso - different, etc.). A comparison 
presupposes a relation between different entities so these adjectives can be linked to both PROPERTY 
and RELATION. Since in the EWN TO these two Top Concepts were two different kinds of SITUATION 
TYPE, they were mutually exclusive; now, in the IWN revised TO they can be conjoined. 

3.4. The Domain Ontology 

Domain concepts within a Domain Ontology (DO) group together words relevant to a specific 
domain. The best approximation of domain concepts are the field labels used in dictionaries (e.g. 
Medicine, Architecture), though their use is restricted to words used in specific terminological 
domains. In the Princeton WN, too, domain concepts seem to be used occasionally and without a 
consistent design. In the EWN model a DO was foreseen, although only a small part of it was actually 
realized (namely, the domain of computer terminology). 

Information brought by the DO is complementary to what is already in WN. First of all, a domain 
concept may include synsets of different syntactic categories: for instance MEDICINE groups together 
senses from nouns, such as doctor# 1 and hospital# 1, and from verbs such as operate#7. Secondly, a 
domain concept may also contain senses from different WN sub-hierarchies (i.e. deriving from 
different “unique beginners” or from different “lexicographer files”). For example, the SPORT concept 
contains senses such as athlete# 1, deriving from life_form#l, game_equipment#l from 
physical_object#l, sport#l from act#2, and playing_field#l from location#l. 

There are two main steps in providing a DO for a lexical resource such as IWN: first, both the set of 
domain concepts and their organization have to be defined; second, domain concepts need to be 
connected to proper ILI synsets. 

As far as the DO definition is concerned, for the IWN database a DO previously developed at IRST 
(Magnini and Cavaglia, 2000), was adopted. The DO includes about 250 domain concepts organized 
in a hierarchy, where each level is made up of concepts of the same degree of specificity: for example, 
the second level includes concepts such as BOTANY, LINGUISTICS, HISTORY, SPORT and RELIGION^ 
while at the third level we can find specializations such as American_history, Grammar, 
Phonetics, Tennis. 

The fragment of the DO relevant to the economic domain includes ten concepts, organized in the 
hierarchy reported below. These concepts correspond to the most frequent labels used by dictionaries 
to indicate a specialized use for a term. 


Top-Concept 

ECONOMY 

"> BOOK-KEEPING 
EXCHANGE 
TAX 

-> MONEY 

~7 BIN I BtU'lU&B 
BANKINO 
-> INSURANCE 
-> COMMERCE 
-> TRANSPORT 


5 Since this node is used for situations involving the possibility or likelihood of other situations. 


6 




Once a DO is defined, it is necessary to link domain concepts to appropriate ILI synsets; then, through 
equivalence relations, the information is acquired by monolingual synsets. For the IWN database 
these connections have been established only with respect to Ihe ILI of the terminological wordnet. 

4. Building the Terminological WordNet 

A specialized wordnet for the economic and financial domain (ECOWN) has been realized as part of 
the SI-TAL project. Although ECOWN adopts the same linguistic model of the generic database, 
there are two remarkable differences between the two resaurces: i) the ECOWN ILI consists of 
English synsets belonging to WN 1.6, instead of WN 1.5; ii) ECOWN synsets are linked to a small, 
domain oriented DO. 

ECOWN is organized around the same general principles of the generic resource. In order to realize 
it, a domain expert firstly designed the ECOWN top level, which contains the most relevant concepts 
of the domain. A number of basic terms were then extracted from various information sources on the 
basis of their frequency and significance in the economic and financial domain. This phase produced a 
list containing about 100 basic terms, each of which was considered as a possible root for an ECOWN 
sub-hierarchy. Hyponym terms were searched with the same “head” of the starting basic term, each 
group of terms was structured using the hyponymy relation and in this way about 2,500 terms were 
collected and structured. 

Through a consultation of the available information sources, a number of synonymic variants 
(including some acronyms) were inserted and terms detected from the treebank corpus were added, 
for a total of about one thousand new terms. Some selected sub-domains were explored in more detail 
in order to add more specialized terms, thus reaching a total of about 5,000 terms distributed in about 
4,700 synsets. 

By consulting a bilingual electronic dictionary partly automatically, partly manually, all the English 
equivalents of the Italian terms were identified and, finally, each synset was annotated with at least 
one concept of the DO. 

5. Integrating the Generic and the Terminological WordNet 

In this section we present a plug-in approach that has been conceived to enable an integrated access to 
the two resources. 

The issue of ontology merging has received increasing attention in the last years (see Hovy, 1998) but 
it should be noted that the plug-in approach, described in more detail in Magnini and Speranza (2001), 
is quite different from the general issue. In fact, our scenario is simplified because one of the two 
resources involved in the integration process is generic and the other is specialized. IWN contains 
generic knowledge with no domain specific coverage and with many relations, both lexical and 
related to linguistic aspects, while ECOWN focuses on the economical domain, providing sub- 
hierarchies of highly specialized synsets with a limited use of lexical and linguistic relations. 

Firstly, we investigate a number of correspondences between the two resources and then we illustrate 
the main features of the plug-in approach and describe the actual procedures involved in the 
integration process. /*; . 

5.1. Correspondences between IWN and ECOWN 

ECOWN and IWN consist of quite different hierarchies, nevertheless it is possible to find specific 
correspondences between the two databases as far as single synsets are concerned. We have identified 
three kinds of correspondences between ECOWN and IWN synsets: 


7 



1. Overlapping: It occurs when two synsets, the one belonging to IWN and the other to ECOWN, 
share the same meaning. Complete overlapping can generally be found when dealing with 
concepts which are very common although belonging to the specific domain of economy. 

2. Over-differentiation: Over-differentiation of IWN occurs when one single ECOWN synset 
corresponds to two or more IWN synsets, as a consequence of the different attention domain 
experts and lexicographers pay to linguistic phenomena. As an example, regular polysemy (see 
Pustejovsky, 1995) is generally taken into consideration by lexicographers when defining words, 
whereas domain experts may completely ignore it Similarly, one single IWN synset may 
correspond to two or more ECOWN synsets as a consequence of subtle distinctions made by 
experts, which appear irrelevant to lexicographers addressing people without a specific knowledge 
of the subject (over-differentiation of ECOWN). 

3. Gaps: Some ECOWN synsets do not have any corresponding synset in IWN, which points out the 
gaps existing on the part of IWN. ECOWN synsets without a corresponding IWN synset often 
refer to technical notions and in these cases the hypemyms of the synsets in question will probably 
have a corresponding synset in IWN and so connections can be established. It may also happen, 
however, that an ECOWN synset without a corresponding IWN synset has no hypemym with 
corresponding IWN synsets. What can be established in this case is a relation of hyponymy 
between the synset in question and an IWN synset with a more generic meaning. The synset 
{riparto} 600 (“allotment”), for instance, does not have a corresponding synset in IWN and could 
be connected to a more general synset, such as (divisione, partizione, ripartizione, 
frazionamento} IWN (“division”). 

5.2. Plug-in Approach 

The plug-in approach is based on the idea that the specialized wordnet can be plugged into the generic 
wordnet, resulting in a richer resource. There are two basic requirements for a plug-in integration: 

• Precedence criteria: in the construction of the integrated wordnet the expert’s point of view is 
given absolute precedence as far as domain specific information is concerned. As for generic and 
linguistic information, on the other hand, IWN synsets and relations are maintained. 

• Terminological coverage: in the integrated wordnet all the terminal nodes of the specialized 
wordnet must be reachable, which guarantees that no part of the expert knowledge is omitted. 

The whole apparatus to realize an integrated wordnet is based on the use of three plug-in relations 
(PLUG_ SYNONYMY, PLUG_NEAR_SYNONYMY and PLUG_HYPONYMY) that connect ECOWN synsetS 
to corresponding IWN synsets, and on the use of eclipsing procedures that shadow certain synsets, 
either to avoid inconsistencies, or as a secondary effect of a plug-in relation. 6 

Plug_synonymy is used to establish connections between IWN and ECOWN whenever overlapping 
synsets are found. The main effect of establishing a relation of PLUG_SYNONYMY between {a} Iw ^ and 
( a i } ecoWN is the creation of a new synset {ai} pug which substitutes the connected synsets (i.e. the 
synsets directly involved in the relation) in the integrated wordnet. The new synset {a»} plug gets its 
variants from ECOWN (i.e. those of {ai}* 50 ^), its hypemym from IWN (i.e. that of (a) ,WN ) and its 
hyponyms from ECOWN (i.e. those of { a ,} ccoWN ) 


6 As shown in the general architecture in Fig. 1, plug-in relations connect IWN and ECOWN without having any effect on 
the IU. 


8 





Figure 2a: {a} IWN and {ai} ccoWN , are represented Figure 2b: Effects of a PLUG SYNONYMY relation 

together with their hypemyms and hyponyms established between {a} ,WN and {a 1 } ecoWN 

An exemplification of these effects is provided in figure 2b, where the integrated wordnet contains the 
new plug-in synset {ai} plug resulting from the creation of a PLUG SYNONYMY relation between {a} IWN 
and {ai}** 0 ^ in figure 2a, together with its hypemym and hyponyms. 

As far as secondary effects are concerned, we can observe that the hyponyms of {a} IWN (i.e. {b} IWN , 
{c} ,WN and {d} 1 '™) and the hypemym of {a 1 } ecoWN (i.e. {yi} cooWN ) are not visible in the integrated 
wordnet, which means that they are eclipsed. 

PLUG_NEAR_SYNONYMY is used in case of synset over-difTer mtiation. From the point of view of the 
effects, there is no difference between connections established through PLUG_NEAR_SYNONYMY or 
PLUG_SYNONYMY. As already said, a new plug-in synset is created, which gets its hypemym from 
IWN and its hyponyms and variants from ECOWN. 

PLUGHYPONYMY is used to establish connections between an IWN synset and an ECOWN synset 
with a more specific meaning when no corresponding synset is provided in the generic database, i.e. 
whenever a gap is found in IWN. 

The main effect of establishing a relation of PLUG_HYPONYMY between {p} 1 ^ and {qi} 600 '™ is the 
creation of two new plug-in synsets, {p} p,ug in substitution cf {p} IWN and {qi} plug in substitution of 
{q,} ccoWN . The new synset {p} plug gets its variants and its hypemym from IWN (i.e. those of {p} 1 ^), 
while its hyponyms include {qi} plug in addition to the hyponyms of {p} IWN ; the new synset {qi} plug 
gets its variants and hyponyms from ECOWN (i.e. those of (qi} ccoWN ), while its hypemym is {p} plug . 

An exemplification of these effects is provided in figure 3b, where the integrated wordnet contains the 
new synsets {p} p,ug and {qi} plug resulting from the creation c f a PLUG_HYPONYMY relation between 
the synsets {p} 1WN and {qi} 600 ^ in figure 3a, together with their hypemyms and hyponyms. In 
particular, it should be noted that the establishment of a PLLG_HYPONYMY relation between {p} lwN 
and {qi} ccoWN has the effect of eclipsing the hypemym of {qi}'* 0 ^ (i.e. {y l } ecoWN ). 


t=> 


Figure 3a: {p} ,WN and {qi } ocoWlsJ > are represented Figure 3b: Effects of a PLUGHYPONYMY relation 

together with their hypemyms and hyponyms established between (p} IWN and {qi} ceoWN 





Integrate*! wordnet 


X 


a, 

/ 

\ 

e, f| 

gi 


9 







In order to establish a plug-in relation between two synsets it is necessary, first of all, that the two 
synsets are both visible, which means, in other words, that it is impossible to establish a relation 
between two synsets if one of them has already been eclipsed as effect of a plug-in relation previously 
established. Secondly, it is only possible to establish a plug-in relation if it does not eclipse a plug-in 
relation previously established, i.e. a synset which has already been connected to another synset 
through a plug-in relation. 


As already said, eclipsing occurs as a secondary effect of plug-in relations but it is also an 
independent procedure used to avoid that pairs of synsets overlapping semantically, but placed 
inconsistently in the taxonomies, are included in the integrated wordnet. 

5.3. Integration Procedure 

The plug-in approach described in the previous section has been realized by means of a semi¬ 
automatic procedure consisting of four main steps: 

1. Basic synsets identification. The domain expert identifies a preliminary set of informative synsets 
(“basic s>msets”) of ECOWN. These synsets represent concepts that are highly representative of 
the domain and are also typically present in the generic wordnet. In addition, it is required that 
basic synsets are disjoint among each other and that all terminal nodes have at least one basic 
synset in their ancestor list. 

2. Alignment. Each basic synset is aligned to the more similar synset provided in IWN, on the basis 
of its structural and lexical properties. Then a plug-in relation is established between each pair of 
corresponding synsets, thus resulting in a candidate plug-in configuration. 

3. Merging. For each candidate plug-in configuration, an integration algorithm reconstructs the 
corresponding portion of the integrated wordnet. If the integration algorithm detects no 
inconsistency, the next candidate plug-in configuration is considered, otherwise step 4 is called. 

4. Resolution of inconsistencies. An inconsistency occurs when the implementation of a candidate 
plug-in configuration is in contrast with an already realized plug-in configuration. In this case the 
domain expert has to decide which configuration has the priority and consequently modify the 
other configuration, which will be passed again to step 2 of the procedure. 

The creation of the integrated wordnet required 269 plug-in relations (92 PLUG_ SYNONYMY, 36 
PLUG_NEAR_SYNONYMY and 141 PLUG_HYPONYMY relations), resulting in 4662 ECOWN synsets 
connected to IWN, which means that each relation connects averagely 17.3 synsets. Among these 
4662 ECOWN synsets, 577 substitute the corresponding IWN synsets, while the remaining 4085 
synsets are more specific concepts that were properly added. 

The integration of IWN and ECOWN through the plug-in approach enables to either access IWN and 
Ecown independently or in conjunction with each other. 

* • 

6. Conclusions 

IWN has been used as reference lexicon for the Italian Syntactic-Semantic Treebank and for the 
annotation of the lexical sample in the SENSEVAL-2 initiative. A future necessary step, in view of 
having a more complete Italian semantic lexicon, will be to link the IWN resource and the PAROLE / 
SIMPLE one. Only in this way future users will benefit of the complementary information types 
contained in the two resources, and most suited to applications of different nature. This step will also 
orovide a means for a sort of cross-evaluation of both resource?! 

References 

Alonge, A., Calzolari N., Vossen P., Bloksma L., Castellon I., Marti T., Peters W. (1998) The 
Linguistic Design Of the EuroWordNet Database. In: “Special Issue on EuroWordNet”. Computers 
and the Humanities, Ide N., Greenstein D:, Vossen P., eds., Vol. 32, Nos. 2-3 1998, 91-115. 


10 



Chierchia, G. and McConnel-Ginet S. (1990) An Introduction to Semantics. MIT Press, Cambridge, 
MA. 

Cruse, D. A. (1986) Lexical Semantics. Cambridge Universily Press, Cambridge. 

Fellbaum, C. (1998) cd., WordNet: An Electronic Lexical Database , MIT Press, Cambridge, MA. 
Fellbaum C., Gross D., Miller K.J. (1993) Adjectives in WordNet , Five Papers on WordNet. 

Hovy, E. H. (1998) Combining and Standardizing Large-Scale, Practical Ontologies for Machine 
Translation and Other Uses. In: Proceedings of the 1st International Conference on Language 
resources and Evaluation (LREC-’98). Granada, Spain. 

LakofT, G. (1966) Stative Adjectives and Verbs in English. Computation Laboratory, Harvard 
University Report No. NSF-17. 

Lyons, J. (1977) Semantics. Cambridge University Press, London. 

Magnini, B and Cavaglia, G. (2000) Integrating Subject Field Codes into WordNet. In: Proceedings of 
the Second International Conference on Language Resources and Evaluation (LREC 2000), Athens, 
Greece, pp. 1413-1418. 

Magnini B. and Speranza, M. (2001) Integrating Generic and Specialized Wordnets. In: Proceedings 
of the Euroconference RANLP ’2001, 149-153. 

Peters, I., Peters W. and Gaizauskas R. (1999) The Representation of Adjectives in SIMPLE. Ms. 
Pustejovsky, J. (1995) The Generative Lexicon. MIT Press, Cambridge, MA. 

Quirk, R., Greenbaum S., Leech G. Svartvik J. (1985) A Comprehensive Grammar of the English 
Language. Longman, London. 

Roventini, A., A. Alonge, F. Bertagna, N. Calzolari, J. Cancila, R. Marinelli, A. Zampolli, B. 
Magnini, C. Girardi, M. Speranza. (forthcoming) Ital WordNet: building a large semantic database for 
the automatic treatment of Italian, Linguistica Computazionale, Giardini, Pisa. 

Sanfilippo, A., Calzolari N., Ananiadou S., Gaizauskas R., Saint-Dizier P., Vossen P. (1999) eds., 
Preliminary Recommendations on Lexical Semantic Encoding, EAGLES LE3-4244 Final Report. 


Adriana Roventini, Istituto di Linguistica Computazionale, CNR, Area della Ricerca di Pisa. Via Alfieri 1, Loc. S. 
Cataldo, Ghezzano 56010 (PI) - ITALY, adriana@ilc.pi.cnr.it 

Antonietta Alonge, Sezione di Linguistica, Facolta di Lettere e Filosofia, University di Perugia, Piazza Morlacchi 
11, Perugia 06100 - ITALY, antoalonqe@libero.it 

Francesca Bertagna, Consorzio Pisa Ricerche, Via S. Maria 40, Pisa 56100 - ITALY. F.Bertaana@ilc.pi.cnr.it 
NIcoletta Calzolari, Istituto di Linguistica Computazionale, CNR, Area della Ricerca di Pisa, Via Alfieri 1, Loc. S. 
Cataldo, Ghezzano 56010 (PI) - ITALY, qlottolo@ilc.pi.cnr.it 

Rita Marinelli, istituto di Linguistica Computazionale, CNR, Area della Ricerca di Pisa, Via Alfieri 1, Loc. S. 
Cataldo, Ghezzano 56010 (PI) - ITALY, marinell@ilc.pi.cnr.it 

Bernardo Magnini, ITC-Irst, Istituto per la Ricerca, Scientifica e lecnologica, Via Sommarive 18, 1-38050 Povo 
(TN) - ITALY, magnini@itc.it 

Manuela Speranza, ITC-Irst, Istituto per la Ricerca, Scientifica e Tecnologica, Via Sommarive 18, 1-38050 Povo 
(TN) - ITALY, manspera@itc.it 

Antonio Zampolli, Istituto di Linguistica Computazionale. CNR. Area della Ricerca di Pisa, Via Alfieri 1, Loc. S. 
Cataldo, Ghezzano 56010 (PI) -ITALY 


11 



BALKANET: A Multilingual Semantic Network for Balkan 

Languages 


Stamou Sofia, Oflazer Kemal 
Pala Karel, Christodoulakis Dimitris 
Cristea Dan, Tufis Dan 
Koeva Svetla, Totkov George 
Dutoit Dominique, Grigoriadou Maria 


Abstract 

BalkaNet aims at building a multilingual lexical database consisting of WordNets in several 
Central and Eastern European languages. Even though it will be built in a similar way with 
EuroWordNet, new features will be implemented ranging from structuring the Inter-Lingual- 
Index to ensure linking of conceptual equivalencies across WordNets to the development of 
an inter-networked WordNet Management so that each partner retains full responsibility and 
independence of his local WordNet whereas at the same time they will be able to view other 
WordNets and check their compatibility. 

1 Introduction 

BalkaNet is a funded project (IST-2000-29388) that aims at building a multilingual 
lexical database consisting of WordNets in Central and Eastern European languages. 
Each monolingual WordNet will be structured along the same lines as EuroWordNet 
(EWN), (Vossen, 1998) i.e., synonyms are grouped in synsets, which in their turn are 
related by means of basic semantic relations such as hyponymy, meronymy, 
antonymy, etc. Equivalence relations between synsets in different languages will be 
made explicit in the so-called Inter-Lingual-Index (ILI) adopted from EWN. ILI is an 
unstructured collection of concepts with the only purpose to provide an efficient 
mapping across languages. However, ILI will be modified in order to reflect the 
lexicalization patterns of Balkan languages and will be structured to enable efficient 
mapping of senses in the BalkaNet database. BalkaNet aims at developing a 
multilingual resource representing semantic relations among basic concepts of the 
following Balkan languages: Greek, Turkish, Romanian, Bulgarian, Czech and 
Serbian. BalkaNet includes semantic relations existing in each of the above 
languages, as language internal relations, as well as among them, as equivalence 
relations to the ELI. BalkaNet will as much as possible be built from available lexical 
resources so that it will be possible to combine information from independently 
created resources, making the final database more consistent and reliable while 
keeping at the same time the richness and diversity of the vocabularies of the 
languages involved. The main resources of information are going to be the individual 
monolingual WordNets that have already been developed or are currently under 
development for most of the participant languages. Where a monolingual WordNet is 
not available dictionaries, thesauri or corpora of the respective languages will be used 
for the terminology extraction. For the development of BalkaNet a merge model 

armrnarVi will ht* WrovTKTat will 

independently developed resources and then linked to the most equivalent concepts in 
the ILI. We aim at a total set of 15.000 comparable synsets in each language, 
corresponding with more or less 30,000 literals, covering generic vocabulary of the 
involved languages. The Part-Of-Speech (POS) distribution will be 65% nouns, 25% 
verbs, 5% adjectives and 5% adverbs. In addition, monolingual WordNets developed 
from scratch within the framework of the project will comprise approximately 8,000 



synsets whereas the number of synsets that will be added in already existing ones will 
be determined at a later stage. Finally, in order to keep compatibility with EWN the 
Language Independent Module, namely the Top-Concept Ontology, will be 
maintained along with the JLI records. One differentiation from EWN concerns the 
structuring of the ILI. The motivation behind structuring the ILI originates from 
various problems related to the mapping of senses in EWN. More specifically, 
because of high level of sense differentiation in the ILI there is a danger that 
conceptual equivalencies across WordNcts are not linked to exactly the same sense of 
the English equivalent but instead connected to distinct ELI concepts reflecting 
different senses of the same word. In order to account for these diverging mappings 
from local WordNets onto ILI concepts, domain labels are going to be included in the 
ILI and the latter will be structured on the basis of the top ontology so that terms 
linked to the ILI correspond to the same conceptual domain even if they are not exact 
translational equivalents. In addition, a structured ILI would mean a grouping of the 
ILI concepts that belong to the same conceptual domain enabling thus a preliminary 
clustering of terms and as a consequence a preliminary clustering of documents 
indexed on the basis of the ILI. The project started a few months ago, thus no new 
synsets have been developed so far. Members of the consortium are in the process of 
setting the requirements, specifying the methodology to be followed for the 
development of the WordNet Management System, the structuring of the ELI and 
developing tools for processing the monolingual lexical resources. In addition, the 
Base Concepts and the Top Ontology of EWN along with their internal relations are 
being carefully examined and checked against Balkan lexical resources in order to be 
enriched with Local Base Concepts and conclude on their applicability to BalkaNet. 

2 Implementing the BalkaNet through a WordNet Management System 

The main differentiation between BalkaNet and EWN relies on the reusability and 
openness of the tools and software. More specifically, EWN has been constructed by 
using Polaris tool (Bloksma, 1996), which we feel has a few drawbacks. First and 
foremost it is a commercial stand-alone tool designed solely for WordNet 
maintenance that cannot be easily adapted to a new application and runs only in 
Microsoft Windows. For BalkaNet a (inter) networked tool will be developed to help 
partners coordinate their work online. Although ii: would be technically possible, we 
do not want to create a fully Internet-based Web Polaris tool, since the Web cannot 
yet deliver a full-blown graphical user interface, and this would unnecessarily restrict 
local editing of WordNets. However, keeping all the benefits of the Web, such as 
distributed work environment, concurrent access to the data and multiple views of the 
data will be achieved through the WMS. Thus, we intend to develop a WMS that will 
allow the local tools to retrieve the required information. However, since the Internet 
is not always reliable the offline operation of local tools will be the primary mode of 
the WMS whereas the online one will be an exira facility. This way, a WMS that 
supports both online and offline integration with local tools, plus a good dedicated 
online interface is soins to be a oowerful federated platform for coordinated 
development of the monolingual WordNets while at the same time the construction of 
a multilingual database will be feasible. So far, EWN shares the same concept of a 
multilingual synset in the ELI. New records can only be added at the tail of the file and 
are maintained by a central authority that issue? periodical releases of a new ELI 
record replacing the previous one. The WMS will provide a more flexible reference 
scheme that enables local WordNets to keep references to the ELI even while the latter 
is significantly restructured. The benefit behind using the WMS is that project 


13 



developers will be tightly linked with other WordNets and valuable suggestions for 
new terminology fields will be facilitated. The WMS will be as open to the user as 
possible since it will be fairly easy for the users to develop and add their own 
components to the system. This can be accomplished by either encapsulating in the 
system capabilities for “plugging in” other applications, or by deploying the system 
under a free source license. This way, users will be able to use the same platform for 
their work and keep at the same time the data compatible. A new browser (editor) 
developed for the BalkaNet project will be able to work with WordNet files written in 
XML and it will also employ client-server architecture. The above tools will be 
developed in Linux platform and the results will be widely available. The central 
infrastructure of the WMS is going to be a federated database along with necessary 
communication protocols and Linux-based tools, which will run locally and provide 
central services. Summarizing, the (inter) networked WMS is going to be a platform- 
independent tool that will enable development of monolingual WordNets and their 
linking into a central database. 

Conclusions 

A central multilingual database with WordNets for a set of Central and Eastern 
European languages along with a WordNet Management System will be developed. 
Furthermore, an adjustment of BalkaNet to EWN semantic network will be attempted 
so as to extend the latter and make cross-language information retrieval efficient for 
the less-studied Balkan languages. Finally, with the implementation of BalkaNet it 
will be possible to trace and explore relationships among Romance and Balkan 
languages. 

A cknowledgem ents 

This research was supported by the EC in the framework of the BalkaNet project, Ref.No. 
IST-2000-29388. We wish to thank all partners of the project for their valuable contribution 
and Dr. Piek Vossen and Prof. Christiane Fellbaum for their support. 

References 

Bloksma L., Diez-Orzas, Vossen P (1996) The User Requirements and Functional 
Specification of the Euro WordNet project EWN-deliverable D.00I, Le-4003 

Vossen P. (1998) A Multilingual Database with Lexical Networks, Kluwer Academic 
Publishers, Dordrecht 


S. Stamou {stamou@cti.gr} and D. Chnstodoulakis {dxri@cti.gr}> can be reached at 
Databases Laboratory, of Computer Engineering & Informatics Department, Patras 
University, Greece. K. Oflazer {oflazer@sabanciuniv.edu} can be reached at Human 
Language Technology Laboratory, Sabanci University, Orhanli, Tuzla, Istanbul, Turkey. K. 
Pala {pala@fi.muni.cz} can be reached at Faculty of Informatics, Masaryk University, Bmo^ 
Czech Republic. D. Cristea {dcristea@infoiasi.ro} can be reached at Faculty of Informatics,’ 
University Alexandru loan Cuza, Iasi, Romania. D. Tufis {tufls@racai.ro} can be reached at 
Centre for Advanced Research in Machine Learning, NLP and Conceptual Modeling, 
Academia Romana, Bucharest, Romania. S. Koeva {svetla@ibl.bas.bg} can be reached at 
ouigNtittfi Academy 01 science, msutuie or vulgarian Language, sena, .Bulgaria, (i. lotkov 
{totkov@ulcc.uni-plovdiv.bg} can be reached at Computer Science Department, University of 
Plovdiv, Plovdiv, Bulgaria. D. Dutoit {memodata@wanadoo.fr} can be reached at Memodata 
Natural Language Department, Caen, France and M. Grigoriadou {gregor@di.uoa.gr} 
Department of Informatics & Telecommunications, Athens University, Greece. 


14 



An Ontology and a Semantic Network for Danish Time Adverbs 
- based on the SIMPLE Lexicon Model 


Sanni Nimb 


Abstract 

The aim of a recently initiated ph.d.-project on Danish adverbs is to give a semantic lexical 
description of Danish lexical adverbs, in order to extend a Danish computational lexicon - the 
SIMPLE lexicon - with this word category. The Danish SIMPLE lexicon contain encodings of 
approx. 10,000 word senses, performed on the basis of a unified, ontology-based semantic model 
representing an extended qualia structure. In this paper we describe the classification of approx. 120 
lexical time adverbs into different subtypes, and the establisiment of a corresponding subontology of 
these adverbs according to the SIMPLE model. It is shown how some of the adverbs inherit 
information from several nodes in the ontology. Finally we discuss how the semantic relations and 
features used for the description of time and Aktionsart, which are already implemented in the 
SIMPLE lexicons, can be used as well in the lexical description of time adverbs, and how semantic 
relations between the different adverbs can be encoded in the lexical entry. 

Introduction 

The aim of a project initiated first of September this year at Center for Language Technology, 
Copenhagen, is to give a formalised syntactic and semantic lexical description ofJDanish lexical 
adverbs in order 1): to extend a Danish computational lexicon (the SIMPLE lexicon) with semantic 
information, e.g. relations to other words, on lexical adveibs, and 2): to be able to supply another 
Danish computational lexicon, the STO lexicon, of which the semantic descriptions are based on the 
SIMPLE lexical data, with syntactic information on adverbs. 

Lexical semantic information on time adverbs can be useful in many cases in language technology 
products, e.g. in advanced information retrieval systems where information on time is needed, and in 
machine translation programs. For instance the time adverb is often decisive for the choice of verb 
wrt. aspect and time in translation, e.g. when translating from germanic languages into roman 
languages. 

The STO lexicon is in preparation at the moment. Apart from being based on the results and lexical 
data from the Danish SIMPLE lexicon, it is also based on another former EU-project on 
computational lexicography, namely the LE-PAROLE lexicon, containing information on 
morphology and syntax. In the LE-PAROLE and SIMPLE lexicons the word classes of nouns, verbs 
and adjectives are well described and well covered, but adverbs are not treated in details; this is 
especially the case for the SIMPLE lexicon. 

At the moment there are also plans to link the Swedish, the Danish and the Norwegian SIMPLE 
lexicon, which is under development. A second aim of the project is therefore to be able to create a 
more complete wordnet between these three languages, also covering adverbs and not only nouns, 
verbs and adjectives. 

We will in this paper propose a subontology covering a restricted group of adverbs, namely time 
adverbs, and show how a set ot semantic relations trom the S1MPLL project can be used also m a 
formal description of this group of lemmas. 

1.1 The SIMPLE Lexicon 

The aim of the EU-project SIMPLE (Semantic Information for Multifunctional Plurilingual Lexica) 
was to provide harmonised semantic lexicons for Natural Language Processing for 12 of the 



European languages. The language specific encodings in the lexicons are performed on the basis of a 
unified, ontology-based semantic model representing an extended qualia structure. 

One of the fundamental assumptions behind the SIMPLE model is that word senses differ in terms of 
their internal complexity and that this complexity can be described on the basis of an ontology 
established along different dimensions [LENCI ETAL. 2000]. Some word senses can be described by 
means of simple types, which means that they inherit their information from only one mother node in 
the ontology; others are more complex and thus inherit information from several mother nodes 
following the principle of orthogonal inheritance 39 . These types are called unified types. The multiple 
dimensions of meaning are represented in SIMPLE by means of an extended qualia structure model 
based on (PUSTEJOVSKY 1995] encompassing a set of semantic relations such as is_a, usedJor , 
part of has_as_parts, is_the_result_of etc. for each qualia role (see also [ALONGE ETAL. 1998] for 
the use of similar semantic relations in EuroWordNet). 

1.2. A Classification of Danish Lexical Time Adverbs 

For the word category of lexical adverbs, building an ontology is not as evident as for the other main 
word classes, nor the establishment of a set semantic relations that can be used in a form al 
description. E.g. WordNet 1.6 contains mostly adverbs derived from adjectives via -lv 
affixation.[FELLBAUM, 1998]. 

In this paper we will focus on a group of lexical adverbs, namely those with a time dimension, in 
order to examine to which extent is is possible in the first place to build a unified, ontology-based 
semantic model for a restricted group of about 120 words. The result though, can easily be applied on 
other adverbials with a time sense, and will therefore in fact also cover a large group of lexicalised 
multiword entities. 

The adverbs that we will focus on were selected from a list of lemma candidates from Den Danske 
Ordbog (The Danish Dictionary), a national Danish Dictionary project of which the first volumes are 
published 2002. The list was based on both corpus inspection and references to several dictionaries 
and contained approx. 1250 lexicalised adverbs. Examining these manually, about 120 were assigned 
a sense which had something to do with time. Following [Klein, 1994] in order to test his categories 
of time adverbs in the first place, the 120 adverbs were initially roughly categorised in the following 
groups: 

1) Positional temporal adverbs. They specify points in time in relation to other points in time, which 
are supposed to be given in context. The point in time related to can be either deictic (the time of 
utterance) or anaphoric (a point in time given somewhere in the linguistic context). Examples are 
i gar (yesterday), senere (later). 

2) Temporal adverbs of frequency. They indicate the frequency of temporal entities, like time spans 
or possible situations which obtain at these time spans. Examples are ofte (often), sdmmetider 
(sometimes), altid (always), sjceldent (rarely). 

3) Temporal adverbs of duration. They specify the duration of temporal entities, like time spans 
and/or situations obtaining at these time spans. Examples are i dagevis (for days), siden (since), 
endnu (still). 

4) Temporal adverbs describing inherent temporal properties of a situation. Examples are hurtigt 
(quickly), langsomt (slowly). 

5) Temporal adverbs which can indicate the position of a situation in a series of situations. Examples 
are ferst (to begin with), allersidst (at the very end). 


39 By ‘orthogonal inheritcncc’ we understand multiple inheritance with the restriction that a feature can only inherit its value 
from one mother node from the same partition. Thus, in SIMPLE each meaning dimension (each qualia role) establishes its 
own partition. 


16 



6) A group of quite different adverbs which fit neither of the: other five classes. Here [Klein, 1994] 
mentions the English adverbs still, already and again. 

During the classification, an adverb would be marked if it was not clearly belonging to a certain 
group, or if it had special characteristics. E.g. in the case of type 3 adverbs, we marked those adverbs 
which were not simply denoting an unlimited or unbounded duration (as Icenge (long time)), but a 
duration which had a left or/and a right boundary (referring to a period beginning or/and ending at a 
certain time - e.g. siden (since), which indicates a period beginning at a certain time in the past but 
which has not ended yet.). In the case of positional temporal adverbs - adverbs indicating a point in 
time we chose to distinguish between deictic relating adverbs and anaphoric relating adverbs, plus 
whether the point in time was future, past and presence. 

In this way extra subcategories were created. Also the adveibs belonging to group 6 were placed in 
different subcategories. 

We ended up with the following main categories and respective subcategories (related to X = 
anaphoric related; related to now = deictic related) : 

Main category: Frequency , subcategories: 1) Frequency precise ; 2) Repetition , 3) Position in a series, 
4) always/never adverbs. 

Main category: Duration, subcategories 1) Limited terminated duration. 2) Duration from past till 
now or other moment-not terminated , and 3) Duration from now or other moment and into the future, 
4) Unlimited duration related to now , and 5) Limited duration related to X 

Subcategory 1 also has the subcategory Always/never adverbs. 

Main category: Point in time, past or future with the subcategories: 1 ^Presence, related to other time 
span, 2) Past, related to other time span. 3) Future, related to other time span. 

1) Has the subcategory la) Presence, related to now, and lb) presence related to other time span + 
modal 

2) Has the subcategory 2a) Past, related to now. And 2b) pan related to other time span + modal 

3) Has the subcategory 2b) Future, related to now 

Main category: Inherent temporal properties of event with the subcategories: 1) Speed of event , 2) 
Speed of Start of event , 3) Progress of event. 

It is interesting that some of the adverbs share characteristics from two nodes in the ontology, e.g. this 
is the case for a group of adverbs which are not easy to categorise nor as clearly duration- nor as 
clearly point in time-adverbs: nutildags (nowadays), for tiden (these days), nu (now), i dag (today), i 
dr (this year). In the ontology proposed we have therefore marked a relation between the two nodes 
Duration , Unlimited duration related to now, and Point of time, Presence related to now (marked 
with broken line in fig.l). 

Also the group of adverbs having a always/never sense share characteristics from two main 
categories, namely both Frequency and Duration. 

Orw*; sncriftl arnnn of time adverhs even share characteristics with a comnletelv different kind of 
adverbs, namely adverbs which are characterised by expressing the speakers attitude to what is said 
(disjuncts, which are adverbs like desveerre (unfortunately) ;md sikkert (probably)). This is the case 
for the time adverbs endelig (at last), omsider (at last) , alle^ede (already) and pludselig (suddenly), 
where the speakers expectations about the time of the happeiing of the event is expressed. We have 
also marked this relation with a broken line in the subontolog/ in fig. 1. 


17 



1.3. A Subontology for Time adverbs 


Frequency (tit (often ),vanligvis (usually)) 


Limited duration. 
i ugevis (for weeks)) 



Duration 

(leenge (for a long time, efterh&nden (as times goes/went on) 


Always/never 


frequency precise 
manedlig (monthly) 


repetition position in a series 
igen (again), ferst (first), til sidst (at last) 
pi /jy(again)) 


Dur. past-not termin/ 
siden (since), endmj( yet) 
(hidtil (till now) / 


Limited Dur.related to X 
sdlcenge, imens 
(in the meantime) 


Dur. from now/moment +fut 
fremover (from now on) 
forelobig (for the time being) 


Unlimited duration related to now 
nutildags (nowadays), for tiden (these 
days)/ 


Modal adverbs 

(express, speakers evaluation) 
desvcerre (unfortunately) 


/ / 

/ Point in time, past, pres^rfce 0 /future 
/ ^kmgang (once), nu (np<v), jerw(late), tidligt(ezi\y)) 


Presence, related to X / past, related to X (da (th^rf), 

sfraA5(immediately) y sam//rf/g / tidligere (earlier) ? ybnnrfen(befon 
(at the same time) / dengang (at that time) / 




future related to X 
derpa (afterwards), 
6agey?er(afterwards) 


/ morge«(tomorrow)) 


/Past, related to now f past, related to X+modal Future related to now 

. (this morning), i gar / endelig, omsider , (finally) forude (in the future), 

iterday) fomylig (recently/ allerede (already) 

/ 

/ 

/ 

/ 

/ 

/ 

/ 


Presence related to now 
nu (now), i dag (today) 
i ir (this year) 


Presence related to X + modal 
pludselig (suddenly) 


Inheren t temporal properties of event 


speed speed of start progress of event 

langsomt (slowly) hovedkulds (precipitate, nonstop, uajbrudt (incessantly) 

hurtigt (quickly) fluks (straightway) 


Figure 1: The proposed subontology for Danish time adverbs (X=anaphoric point in time): 


18 



The idea of building this kind of ontology for the time adverbs is to be able to assign to each adverb 
in the computational lexicon the ontology subtype to which it belongs, and thereby assign the 
characteristics it therefore automatically inherits, and then afterwards add to the lexical entry those 
different semantic relations characterising the adverb more specifically. 

In the SIMPLE ontology model on nouns, there is already given specifications for nouns referring to 
temporal expressions. The abstract ‘Time’ ontology type is ntended for this kind of nouns and give 
the possibility of assigning constitutive qualia information like ‘iterative’ and ‘punctual’ to the nouns, 
and to assign semantic relations like ‘is_a’ (hyperonym), has_as_part (e.g. ‘month’ hasas_part 
‘day’), is_a_part_of (e.g. ‘day’ is_a_ part_of ‘month’), successor_of (‘Tuesday’ successor_of 
‘Monday’). Also semantic relations like synonyms and antonyms can be expressed in the SIMPLE 
model. These different kinds of information are also relevant in the case of the time senses of adverbs. 



Figure 2: Part of the SIMPLE ontology: main types, the Abstract Entity types with the TIME subtype, and the 

Event types: State, Process and Transition. 

The description of time adverbs in SIMPLE must also be integrated with the part of the SIMPLE 
ontology describing events, since time adverbs and events share some of the same kind of semantic 
characteristics, namely those concerning boundaries or felicity, and those concerning tense [Klein, 
1994]. E.g. time adverbs can interact with the Aktionsart of the verb in a sentence, having as 
consequence that the aspectual sense changes, and certain time adverbs tend to combine more easily 
with e.g. process verbs than others. Three event subtypes ia the SIMPLE ontology - namely State, 
Process and Transition, inspired by [Pustejovsky, 1995] (see fig. 1), describe the part of semantic 
information concerning Aktionsart, and a semantic unit describing an event is always assigned one of 
these three subtypes. 


Following the SIMPLE encoding principles for the description of aktionsart and of time entities, the 
entries of the adverbs hidtil (till now) and i dag (today) could, as a first attempt, be composed in the 
following way (see fig.3): 


Semantic Unit 

1 lllll !■■■■■— ^—^■—^1 

Definition: 

Angiver en tidsperiode som strcekker sigjra et tidspunkt ifortiden til (tekstens) 
nu, og som ikke nodvendigvis er afi luttet 

^states a nme pencae streicmng irom a poinr m me past nil presence ipi me text;, 
and which isn’t necessarily terminated) 

Corpus example: 

Angriberen har <hidtil> nosgtet at underskrive kontrakten (Berlingske Korpus, 
99) (the attacker has till now refused to sign the contract) 

Semantic type: 

Time Adverb 

Unification Path 

Duration | Duration Past not terminated 


19 








Semantic Class 

Adverb 

Formal quale: 

is a = tidsperiode (time period) 

Constitutive Quale 

Iterative=yes 


Process=yes 


State=yes 


Transition=no 

Synonym: 

Synonym= hidindtil 

Antonym: 



Semantic Unit 

* dag (today) 

Definition: 

Angiver et tidspunkt inden for samme dag som det er (i tekstens) nu 
(states a point of time within the same day as it is now (in the text)) 

Corpus example: 

Statsminister Poul Nyrup Rasmussen holder kl. 18.00 i <dag> sin nyt&rstale 
(Prime minister Poul Nyrup Rasmussen delivers his new years speech at 6.p.m. 
today) 

Semantic type: 

Time Adverb 

Unification Path 

Presence related to now | Presence related to X 1 Point in time 

Semantic Class 

Time Adverb 

Formal quale: 

is a = tidspunkt (point of time) 

Constitutive Quale 

Iterative=underspecified, 
punctual = yes 

Has_as_part= imorges (this morning) 

Has__as_part= i_aften etc.) (tonight) etc. 

Process=yes 

State=yes 

Transition=yes 


Figure 3: The entries of hidtil (till now) and i dag (today). 


Word relations between the different time adverbs in the lexicon are in these two entries established 
by the semantic relations ‘Synonym’, ‘Antonym’ and ‘Has_as_part’. 

Conclusions 

In this paper we have proposed a classification and an ontology for Danish time adverbs, as well as a 
method of encoding them in the Danish SIMPLE lexicon using the already established semantic 
descriptions in the SIMPLE model. The semantic relations encoded in the lexical entries establishes a 
word net between the different adverbs. 

The further plan of our project is to carry out more detailed investigations on how the different time 
adverbs behave, for instance regarding their interaction with the different kinds of Aktionsart. The 
aim of these investigations is the establishment of a larger set of semantic relations and semantic 
features well suited for the description of adverbs within the SIMPLE model, as well as the actual 
extension of the lexicon with adverb entries. 

References 

Alonge, A., N. Calzolari, P. Vossen, L. Bloksma, I. Castellon, M.M. Marti, W. Peters. (1998) The 
Linguistic Design of the EuroWordNet Database ’, In P. Vossen (ed.) EuroWordNet, A Multilingual 
Database with Lexical Semantic Networks.Kluwer Academic Press. Dordrecht Boston London 
Fellbaum, Christiane. (1998) A Semantic Network of English: The Mother of All WordNets , In P. 
Vossen (ed.) EuroWordNet, A Multilingual Database with Lexical Semantic Networks.Kluwer 
Academic Press, Dordrecht, Boston, London. 

Klein, Wolfgang. (1994) Time in Language , Routledge 


20 




















Lend, A., F. Busa, N. Ruimy, E. Gola, M. Monachini, N. Calzolari, A. Zampolli. (2000) Linguistic 
Specifications Del. D2.1, unpublished SIMPLE report. 

Pustejovsky, J. (1995) The Generative Lexicon, Cambridge, MA, The MIT Press. 

Sanni Nimb, Center for Sprogteknologi, Njaisgade 80, DK-2300, Denmark, sanni@cst.dk 



Adjectives in WordNet-Type Thesaurus: Estonian Experience 


Abstract 


Heili Orav 


There has been the need for computer thesauri in Estonian lexicography for a long time. For 1997 it 
was clear, that besides morphological and syntactical analysis a lexical database based on word 
semantics was needed. Compilation of the Estonian Wordnet started in 1997 and the work is still in 
progress. The existing Estonian wordnet contains nouns, verbs and some adjectives. The present paper 
concentrates on adjectives. The reason: there is non-existing common solution for treatment of 
adjectives in the context of research on practical compiling of thesauri. 

1. Introduction 

Much more has been written in the current linguistic theories about verbs and nouns than on their 
modifiers - adjectives. Outside of the “mainstream” of contemporary linguistic research there exists a 
sizeable body of work on adjectives, typically on their properties in languages other than English. And 
there are theoretical approaches which concentrate on adjectives (Dixon, 1982; Raskin, Nirenburg 
1995), but which are hard to integrate into WordNet type research. 

One goal of the present paper is to point out some important semantic features of adjectives in the 
context of research on practical compiling of thesauri, especially of Estonian wordnet (EstWN). 

2. Wordnet-Type Thesauri 

Finding semantic differences between parts of speech is actually one of the outcomes of building 
various thesauri, which represent a major type of semantic databases. The first and also the most well 
known is WordNet. 

In 1998 Estonia joined the EuroWordNet project. According to tasks of this project all eight 
participants (Dutch, Spanish, Italian, English, Czech, German, French, and Estonian) compiled 
language-specific database about nouns and verbs - adjectives were excluded from this project. One 
of the reasons for this was non-existing common solution for treatment of adjectives. Only after the 
end of EuroWordNet project there were some implication on how to deal with adjectives. 

In the English WordNet the authors distinguish between two classes of adjectives (relational and 
descriptive), which form so called “synonym clusters” and these clusters have either direct or indirect 
antonyms. As a result, the treatment of adjectives in WordNet is “flat structure” because the criteria 
for distinguishing between adjectives are not at all clear, as can be seen by the following quote from 
G.A. Miller (1993): "The decision about which file an adjective is to be listed in is, in the end, 
pragmatic". 

Besides WordNet, there are available some projects that have considered adjectives thoroughly - they 
have worked out detailed theoretical research for presentation of different word classes, including 
adjectives. These projects are GermaNet (http://www.sfs.nphil.uni-tuebingen.de 1 and SIMPLE 
(http://ww w.ub.es/gilcub/SIMPLE) . In these projects adjectives are hierarchically divided according 
to semantic fields. 

Classifying adjectives according to semantic groups into hierarchical relationships is quite 
piouiemauc. ror example, me most aiincuit issue with regard to nouns’ hierarchical relationship is to 
find the top hyperonyms. The same holds also in regard to finding top concepts to the most frequent 
verbs, both transitive and intransitive. As for adjectives, the situation is even more unclear, because 
there are no lexemes that would denote groups of adjectives. 



So it seems that each of these projects has brought out different types of problems implicit in the 
semantic nature of adjectives. 


3. Estonian WordNet 

At the present moment (November 2001) the general EstWN lias about 300 adjectives created in the 
frames of EuroWordNet. The work of recording adjectives is based on frequency list of most frequent 
adjectives and adjectives associated with them. 


3.1 . Frequencies of Adjectives 

The frequency list comes from Corpus of the Estonian Literary Language (CELL), the 1980-s and 
final number of adjectives from coipus is around 10 000. To make sure the real number of occurrence 
of lexical units as adjectives we should use the corpus that is manually morpholopcally 
disambiguated. For an inflectional language like Estonian, morphological analysis and morphological 
disambiguation (because of frequent homonymy among word forms) is extremely important because 
it can decrease ambiguousness between part of speech. 

During this work following issues came out - redundancy or deficiency of word form morphological 
disambiguation in frequency list. More precisely: 

• All adjectives do not behave as pure adjectives do. It depends on morphological 
disambiguation. For example: first word of frequency list 'oma' (own) - that occurs 2972 
times in CELL - behaves more like pronoun. 

• Also are in this list some words with high frequency, which are specific to the soviet period. 
For example: riiklik (governmental), sotsialistlik (socialist) etc. 

• The number of compounds in Estonian is indefinite. It is quite easy for a speaker of Estonian 
to invent new compounds that are not in any dictionary, but nevertheless are easily 
understood. In the frequency list there are lot of such examples. For example 
mikrolaineahjukindel (microwave safe). Hereby, we have a question- are they commonly used 
and should we add them into lexicon. 

. List contains adjectives formed by suffix -//*, appending to the stem of substantives. Speaker 
of Estonian can use this ending freely and the numbsr of words formed by this way may e 
indefinite. The same is with suffix -line. For example loodus+lik (nature -> natural). But a 
the same time there exist words (like harilik (usual)) which are not derivated, since there is no 
noun hari in common usage in the relevant sense. 

• Productive suffixes in Estonian are also -kas and -jas. If these join to the adjectives, they 
form a new meaning, i.e. decreases initial meaning of adjectives. For example: punane (red) 
> punakas (reddish), morn > mdrkjas (bitterish), and hapu >hapukas (tart, sounsh). 

• Participles are one of the real problems in Estonian as well as in English. They can behave as 
verbs and as adjectives depending on context For :xamplc: mmetatud (named, appointed), 
moodunud (past) etc. As adjectives they belong to the class of declinable words in Estonian 

and it’s possible to compare them. 

As seen from the previous observations, there clearly exist problematic questions about base 
vocabulary and there is a need to reconsider the borders of meaning. 


3.2. Semantic Relationships of adjectives 

The next question is - how to link Estonian adjectives into semantic relationships? 

One disadvantage of using WordNet method is the point that like German relational adjectives also 

Estonian relational adjectives often occur as compound adjectives (e.g. ^ u ^ lkai ^ ment { ^^ 
instrument), aatompomm (atomic bomb)). It would be ilso reasonable to avoid using indirect 


23 



antonyms because otherwise artificial antonyms have to be often created (e.g. Wordnet; pregnant -> 
nonpregnant or angry -> not angry). At the same time this is an important relationship, which 
considerably decreases polysemy of words. Exclusion of that kind of antonyms has to be replaced 
with a new relationship that carries the same function. 

In Estonian we should test all variants for taxonomy of adjectives. Is it useful to try to put words one 
by one into hierarchies and see what happens? 

For verification of hyponymy relation it is possible to use frames of lexical-semantic relations 
proposed by Cruse (1986): 

X is kind of Y or 

If it is X, then it is also Y. 

Hereby, the author thinks that world knowledge runs in our thoughts horizontally, not vertically. And 
that goes especially about adjectives. 

Several common adjectives make reference to certain nouns. For example, it seems wrong in Estonian 
to ask - “Oli see punane voi teistmoodi varviline (adjective)?” (Was it red or coloured in other way?). 
Rather we ask “Oli see punane voi monda teist varvi (noun)?” In the same way different adjectives 
indicate different properties of shape, taste, voice, age, psychological statements etc. Lyons (1977) 
terms such relation between noun and adjective as quasi-paradigmatic. He says that all lexica in all 
languages could dispose under small number of lexems with general meaning when hierarchically 
structured lexicon contains in addition to hyponymy also quasi-hyponymy. Problem is that constraints 
on quasi-hyponymy have to be determined. In WordNet this relation is attribute, for example noun 
weight has attributes light and heavy. In EuroWordNet there is relation between nouns and adjectives 
beinstate and stateof. For example: {colour} be_in_state { coloured } or {colourless}. So, it 
seems more appropriate to work on with such semantic relations, because these give more information 
about meaning of adjective. 

4. Conclusions 

Adjectives have not been studied as extensively as have nouns and verbs in traditional lexical 
semantics and yet they are semantically as complex - or even more. 

Therefore, in order to achieve a complete treatment of Estonian adjectives a thorough research should 
be carried out. 

The present paper did not offer solution of peculiarities of adjectives, but it pointed out some 
difficulties, which appeared during practical compiling of wordnet-type thesauri. 

References 

Adjectives in GermaNet. http://www.sfs.nphil.uni-tuebingen.de/lsd/Adi.html 
Cruse, D.A. (1986) Lexical Semantics. Cambridge University Press. 

Dixon, Robert M. W. (1982) Where Have All the Adjectives Gone? In: Robert M. W. Dixon, Where 
Have All the Adjectives Gone? and Other Essays in Semantics and Syntax . Berlin-Amsterdam-New 
York: Mouton. 

Toim. M. Erelt. (1995) Eesti keel grammatika (EKG), Eesti TA EKI, Tallinn. 

Gross D., Fellbaum C. Miller K. (1990) (revised 1993) Adjectives in WordNet. In: International 
Journal of Lexicography 3 (4), ftp://flp.cogsci.princeion.edu/oub/wordnet/5papers.ps 
Lyons J.(1977) Semantics l, Cambridge University Press. 

Miller G. A., Beckwith R., Fellbaum C., Gross D., Miller K. (1993) Five Papers on WordNet. 
Technical Report, Cognitive Science Laboratory, Princeton University. Revised version. 
ftp://ftp.cogsci.princeton.edu/pub/wordnet/5papers.ps 


24 


Miller, KJ. (1998) Modifiers in Wordnet. In C. Fellbaum (Ed.), WordNet An Electronic Lexical 
Database. Cambridge, MA, The MIT Press. 

Raskin, V. and Nirenburg, S. (1995) Lexical Semantics of Adjectives, a micro-theory of adjectival 
meaning MCCS report 95-288 
SIMPLE Final Guidelines 

http://www.ub.es/gilcub/SIMPLE/reports/simple/SIMPLE Fguidelines.rtf.zip 


Heili Orav, Department of General Linguistics; University of Tartu, Tiigi 78 - 204, Tartu 50410, Estonia, 
horav@psych.ut.ee 



Estonian WordNet Benefits from Word Sense Disambiguation 

Neeme Kahnsk 
Kadri Vider 


Abstract 

The effect of lexical resource in Word Sense Disambiguation (WSD) task is bidirectional. The results 
of WSD depend on goodness of lexicon, and lexicon can be improved in the process of WSD. 

About 10 000 content words in texts are manually disambiguated according to Estonian WordNet 
(EstWN) word senses. The main aim of the study is-besides gaining experience in WSD-to find out, 
how well the existing EstWN covers real language usage in texts. 

1. Introduction 

We have been building Estonian WordNet for four years already. Although having less than ten 
thousand synsets in it, the time is ready to evaluate the result. A praiseworthy occasion for this is 
WSD task, as the result of it depends highly on available lexicon. 

One of the sources at building EstWN have been different corpora of Estonian, and in working with 
corpora the question of whether we are dealing with different meanings of a word in case of its 
concrete occurrences or not arises constantly. 

2. Estonian WordNet 

Compilation of EstWN started in 1997 and the work is still in progress. The work was funded partly 
by Estonian Science Foundation and partly in the frames of Estonian National Programme of 
Language Technology. Before joining EuroWordNet (EWN) project in 1998 we managed to create 
about 900 synsets. 

Its relevant to mention that other partners in EWN project have created their wordnets automatically, 
but we have preferred manual work. One reason for this is absence of suitable electronical material, 
and on the other hand, it gives more exact results. The major lexical information comes from 
monolingual explanatory and/or sense distinguishing dictionaries or synonyms dictionaries. In 
Estonian there are not so many of them, more over we can’t use machine-readable dictionaries. 

EstWN is supposed to cover the Estonian base vocabulary at first. 

Besides Base Concepts [ Vossen et al. 1998], translation from Princeton WordNet (WN 1.5) was 
avoided, main emphasis was given on frequency lists of word forms. Word frequency records are 
compiled on the basis of Corpus of Estonian Literary Language (CELL) and the materials of the 
Corpus are also used to define the different meanings of a word and quotations from the Corpus are 
used as examples. 

The existing Estonian wordnet contains nouns, verbs and some adjectives. By now, the last version of 
EstWN contains 9524 synsets, 13364 words and 17076 senses. Overview of EstWN present state is in 
Table 1. 



1 

J 

Noun | 

Verb 



Synsets 

6469 | 

2748 

307 

MM 

No. of senses 


5728 | 

518 

hrMr/H 

Senses per synset 

1.67 j|2.08 i 

1-69 

1.79 

lexical entries (lemmas) 

9159 ! 

3786 

419 

13364 i 

Senses per entry 

1.18 | 

1.51 | 

1.24 

1.28 j 

semantic relations 

13984 

5573 

538 

ZZZM 

Relations per synset 

2.16 

EftSl 1 

1-75 

2.11 

__ ) 


Table 1: EstWN in numbers today 


Concerning semantic relations, the main emphasis is given to synonymy as fundamental semantical 
relation and on hyponym/hyperonym as semantical hierarchies’ constructive relation, like in 
EuroWordNet generally. 

The more detailed description of EstWN is given in the final document of EuroWordNet, Estonian 
part [ Vider et al. 1999]. 

3. Word Sense Disambiguation 

For several language technology applications, it is important to make sure in which sense each word is 
meant. The problem of semantic disambiguation is similar to morphological and syntactical 
disambiguation, but it is hard to say, what is word sense and mostly it depends on goals of 
disambiguation [ Kilgarriffl997]. 

In papers, written about this topic, it is possible to differentiate .wo main approaches: 
fully probabilistic method-presumes a lemmatised text corpus of more than 100 000 tokens; 
lexicon based methods. 

We did not have any manually disambiguated texts, so we decided to disambiguate a reasonable 
amount of them manually and build an automatic WSD system that would make use of existing 
wordnet. 

4. Texts for Word Sense Tagging 

Lexical entries (literals) in EstWN are presented nominal singular form for nouns and supine form for 
verbs. In real texts, the words are mostly in their full richness of forms. There are fourteen cases in 
Estonian. So we have a need for morpholocical analysis in first place. 

The morphological analysis was made with ESTMORF. Lemma and word class are relevant to our 
task. 

On average 45% of the word forms are morphologically amb guous in Estonian texts [Kaalepl997]. 
The ambiguity can be greatly reduced by also applying the Estonian morphological disambiguator 

FKaalen and Vaino 19981 to the text before the word sense disambiguation. 

5. Manual Disambiguation 

Four linguists disambiguated nouns and verbs in the texts, each text was disambiguated by two 
persons. The sense number was marked according to sense number in EstWN. 


27 









































If the word was missing from the EstWN, “0” was marked as sense number, and if the word was in 
EstWN, but missed appropriate sense, “+1” was marked. 

If unconsistencies were met, they were discussed until agreement was achieved. On about 28% of 
cases the disambiguators had different opinions. 

One of the problems that the disambiguators ran into concerned dividing words into different senses 
in EstWN. Due to lexicographical sources what we use, over-differentiation (word meaning marked as 
too specific) and over-generalisation (word meaning marked as too general) may occur. Traditional 
dictionary entries are centered to word and all syntactically or pragmatically specific meanings are 
represented there, but semantically they are not always significant. Today semantically richest verb 
'kaima' have 23 senses, richest noun 'asi' have 11 senses and richest adjective 'raske' have 8 senses in 
EstWN. 

5.1. Lexicon Coverage 

Not all senses found in EstWN are represented in texts. Maximum number of senses per word found 
in texts is 13. This is more than appropriate senses in lexicon (see Table 2), but we must remeber 
about the “+P that disambiguators had, if they found that there are not enough meanings in EstWN. 
Table 2 shows top of lemmas according to number of senses in usage and in EstWN. 



No of senses in text 

Imam 

No of senses in lexicon 

[verb 

13 

saama 


verb 

10 

npn 


noun 

10 

asi 

ii 

E3 

9 



verb 

9 j kaima 

23 

m 

7 

votma 

■■■■ 

S3 

7 

panema 

umiiHi 


nagema 

lllliiKHHHH 

verb 

7 

minema 

17 


leidma 

8 i 


7 


7 


Table 2: Comparison of richest words in senses 


It would be the best, if all words to disambiguate were in the lexicon with all their possible meanings. 
Apparently this presumption is not met. 

The number of compounds in Estonian is indefinite, like in Germanic languages. It is quite easy for a 
writer to make up compounds that are not found in any dictionary, but are still understandable by 
readers. About 46 % of words that are not in EstWN, are compounds. It is not reasonable to enter all 
compounds that one can imagine into wordnet. The difference can be made by analysing the 
components of a compound: if the components are of low frequency as independent words in corpora, 
then the compounds they are building up should be added into the lexicon. 

Anotrier remarkaole class or words not in lexicon are proper names, as There are no proper names in 
EstWN, but about 17.5 % of words with zero-sense are proper names. 

If we will postpone phrasal verbs and some strange words that contain hyphens (about 7 %), it leaves 
us the rest 29 % make up a list of words to be added into EstWN. From point of view of thesaurus 


28 




























































builder this is the most valuable result of the WSD task. Still, the list has to be carefully inspected, 
because there can be text domain specific and rare words in it. 

The senses missing from EstWN (tagged with “+1” in texts) give us valuable information about gaps 
in lexicon too. Those 227 missing noun senses and 163 missing verb senses will be added into EstWN 
in first order. 

6. Automatic Disambiguation 

The main inspiration for our automatic WSD system semyhe is Agirre and Rigau method of 
conceptual density. They disambiguated the English noun senses based on WordNet 
hyponym/hypemym hierarchy, taking into consideration the distances between the nodes 
corresponding to the word senses in the WordNet tree as well as the density of the tree. 

As we had also wordnet-type thesaurus with hyperonym-hyponym hierarchies, we decided to use this 
method, and modified it to process verbs too. 

6.1. Sense Disambiguation Algorithm 

Our main object was to try to disambiguate automatically all nouns and verbs in the texts. 

ii • i •• 

We apply the exact same algorithm for both nouns and verbs. Nouns and verbs cannot be compared 
with each other since in terms of hyponym/hypemym hierarchy they are located in different trees in 
the thesaurus. 

A window is shifted on the text and as a word moves through the window its senses are compared 
with the senses of other words in the window. The context is either made out of nouns or verbs 
depending on which part of speech is being disambiguated. 

The basis of the comparison is the similarity between the senses which is defined through the notion 
of conceptual distance, the distance between the nodes corresponding to the senses in EstWN tree. 
Winners are the senses that minimize the total distance between the word senses in the window, all 
the rest are removed from the list of candidates for the correct reading, semyhe leaves the word 
ambiguous when there are more than one senses with equal result. This usually happens when the 
senses of the context words are located in different hierarchies and hence can not be compared. 
Currently there are 108 different top nodes in EstWN, 29 corresponding to nouns and 79 to verbs. 

On SENSEVAL-2, the Estonian WSD system semyhe gained precision and recall 0.66, that is about 
the same as. the other competive WSD system JHU (recall and precision 0.67) solving Estonian task. 

7. Problems and Discussion 

7.1. Language specific Lexicalisation Patterns 

The Estonian language as Finno-Ugric language is different from other languages in EuroWordNet. 
Some specific lexicalisation patterns appear, when we tried to link Estonian synsets to Inter-Lingual- 
Index [ Oim et al. 1998]. 

in me case ot veros mere is a quite regular discrepancy between iinglssn and Estonian in me 
expression of causativity. In Estonian causative counterparts of noncausative verbs are regularly 
lexicalized, also different lexical entries, and the opposition is productive. For instance, to move (by 
itself) is Tiikuma’, to move (something) is Tiigutama’. 

Also onomatopoetic-descriptive words are more detailed in Estonian. One example: Estonian bear 
'mdmiseb' and lion 'mdirgab'. In English they both roar. 


29 



7.2. Phrases and Multi-word Units 

Estonian is a flective language with a free word order and that makes complicated to figure out all 
phrases. The elements of phrase can be scattered around sentence in an unpredictable order. 

However, it is known that expressions tend to contain frozen forms, including inflectional endings. 
For example, one may not say "*Human Right" or "*Humans Right". "Human Rights" is the only 
correct expression and should be added into thesauri in such form. Phrasal verbs like "ara maksma" 
(to pay off) and idiomatic verbal expressions like "end tukkideks naerma" (to laugh oneself into 
pieces) represent a situation that is different from occasion described above: the verb part may inflect 
freely, but the other word(s) are frozen forms. Hereby, even if we have determined what is phrase or 
collocational multi-word unit, we still have a question- are they commonly used and should we add 
them into lexicon. 

Multi-word expressions are included into EstWN if they build up a conceptual unit and are commonly 
used as lexical units. 

7.3. Lessons of Disambiguation 

The modal senses of verbs are explicitly marked in the output of the morphological disambiguator. 
When a verb is marked as such, then the senses that don't correspond to the modal senses could be 
rejected and the winning sense should be chosen from the prevailing ones, e.g. verb 'saama' has all 
together 12 senses in the thesaurus, but only 2 of them correspond to the modal use of the word (either 
can or may). 

Also, the syntactic structure could help to reduce the number of possible senses to choose from. For' 
example verb 'olema' that was already mentioned above (see Table 2) has five more frequent senses: 
(1) be, (2) exist, (3) stay, (4) occupy an area, (5) have. 

The first sense is present in complementary clauses; senses 2, 3 and 4 appear in existential sentences 
and the last one in possessive sentences. If the information about the nature of the sentence was 
present in the input text it would certainly help the disambiguation process. 

Sometimes it is difficult to compute similarity measure. When different nodes of one word have the 
same parent node and are equally distant from the rest of the sense-nodes so that the similarity 
measure for them will be equal. 

This may indicate the need for revision of a certain part of the wordnet hierarchy. For example, for 
translation system it is crucial that the senses of the word 'naine', which can either stand for woman , 
wife or generally female person, are fully disambiguated, although the senses seem to be very close. 

In EstWN 'naine 3* (wife) and 'naine T (woman, adult female) are hyponyms of 'naine 2' (female 
person ), and every attempt of disambiguation according to conceptual distance would lead to 'naine 2' 
(female person). 

8. Conclusions 

The first attempt to disambiguate Estonian word senses has been fruitful for Estonian WordNet 
builders. 

Results Of WSD of corpus texts turned to be a good way to add missing synsets and senses into our 
wordnet. There were significant unconsistencies in opinions of these people, who disambiguated the 
texts. This shows us the most problematic entries in EstWN, the need to reconsider the borders of 
meaning some concepts. 


30 



For an inflectional language like Estonian, morphological analysis is extremely important and 
morphological and semantic disabiguation can help each other. 

References 

E. Agirre and G. Rigau. (1996) Word Sense Disambiguation using Conceptual Density. In COLING- 
96. 

H.-J. Kaalep and T. Vaino. (1998) Kas vale meetodiga oiged tulemused? Statistikale tuginev eesti 
keele morfoloogiline iihestamine. Keel ja Kirjandus, 1:30-36. In Estonian. English title: Getting 
correct results with an incorrect method? Morphological disambiguation of Estonian using statistics. 
H.-J. Kaalep. (1997) An estonian morphological analyser and the impact of a corpus on its 
development. Computers and the Humanities, 31:115-133. 

A. Kilgarriff. (1997) 1 don't believe in word senses. Computers and the Humanities, 31(2):91-113. 

H. Oim, H., K. Vider, L. Paldre, H. Orav, and K. Pala. (1998) Specification of Czech and Estonian 
WNs. EuroWordNet (LE-8328) Deliverable: 2D003. University of Amsterdam. 

K. Vider, L. Paldre, H. Orav, and H. Oim. (1999) The Estonian Wordnet. In C. Kunze, editor, Final 
Wordnets for German, French, Estonian and Czech. EuroWordNet (LE-8328), Deliverable 2D014. 

P. Vossen, C. Kunze, A. Wagner, D. Dutoit, K. Pala, P. Sevecek, K. Vider, L. Paldre, H. Orav, and 
H. Oim. (1998) Revised Set of Common Base Concepts , EuroWordNet-2 (LE-8328), Deliverable 
2D001. University of Amsterdam. 


Neeme Kahusk, Department of General Linguistics, Tiigi 78-20450401 Tartu, Estonia, nkahusk@psvch.ut.ee 
Kadri Vider, Department of General LinguisticsTiigi 78-20450410 Tartu, Estonia, kvider@psvch.ut.ee 


31 



Methodological Issues in the Building of the Basque WordNet: 
Quantitative and Qualitative Analysis 


Eneko Agirre 
Olatz Ansa 
XabierArregi 
Jose Mari Arriola 
Arantza Diaz de Ilarraza 
Eli Pociello 
Larraitz Uria 

Abstract 


This paper describes the methodology we have adopted to ensure the quality of the Basque WordNet 
in terms of coverage, correctness, completeness and adequacy. The Basque WordNet follows the 
EuroWordNet framework and, basically, it is produced using a semi-automatic method that links 
Basque words to the English WordNet. We have found that in order to ensure proper linguistic quality 
and avoid excessive English bias, a double manual pass on the automatically produced Basque synsets 
is desirable: a first concept-to-concept pass to ensure correctness of the Basque words linked to the 
synsets, and a word-to-word pass to ensure the completeness of the word senses linked to the words. 
By this method, we expect to combine quick progress (as allowed by a development based on the 
English WordNet) with quality (as provided by a development based on a native dictionary). We have 
completed the concept-to-concept review of the automatically produced links for the nominal 
concepts, and are currently performing the word-to-word review. 

INTRODUCTION 

This paper presents work on the Basque WordNet (BasqWN). Our team had an increasing need to 
construct an extensive and complete Lexical Knowledge Base for Basque (LKBB). To this end, our 
point of departure is the English Wordnet developed at Princeton University (Fellbaum, 1998), which 
is consolidating as a ‘de facto’ standard for the lexical-semantic representation for English. 
Considering this English Wordnet as a reference, new WordNets in some other languages have been 
built, especially in the framework of the EuroWordNet project 1 . EuroWordNet (EuroWN) basically 
adds multilingual links across WordNets. 

We present the current state of the Basque WordNet, but this paper rather focuses on the methodology 
that we chose to build the Basque WordNet. The emphasis of this work is on two points: 

• The need to check the data produced with external lexical resources for Basque. 

• The need to perform a thorough quality check of the data produced, with special incidence on 
the necessity of concept-to-concept and word-to-word reviews. 

One of the motivations of this paper is the lack of literature on methods to guarantee and assess the 
correctness, completeness and adequacy of the information produced. EuroWordNet, for instance, 
published an evaluation of the produced WordNets that consisted basically on a comparison of a 
number of ratios (variants per concept, senses per entry, overlaps across wordnets) and an evaluation 
of errors in cross-language equivalence relations (Vossen et al., 1998a; Vossen et al., 1998b). 

The paper is organized as follows. We first present the overall principles for the design and 
^ lugeiiici with die currom state oi me casque woroiNet. Section l presents the automatic 
generation and concept-to-concept review processes that lead to the preliminary BasqWN 0.1 release. 
Sections 3 and 4 review the quantitative and qualitative analysis of BasqWN 0.1. Section 5 


1 http://www.hum.uva.nl/~ewn 



summarizes the conclusions drawn from the analysis. Section 6 presents the method for the word-to- 
word review. Finally, section 7 presents some conclusions and ftiture work. 

1. Design and Methodology 

The design of the Basque WordNet (BasqWN) follows that of EuroWN. In EuroWN each language 
WordNet has its own relations and set of synsets or concepts 2 . The link across the different language 
WordNets is made using the Inter-Lingual Index (ILI), which is a list of concepts. Most of the 
concepts in the ILI come form the English WordNet version 1.5. Words are linked to concepts. If we 
take the concepts linked to one word, each link (i.e. each linked concept) expresses the word senses of 
the source word. If we take the words linked to one concept, each link (i.e. each linked word) 
corresponds to a variant of the concept. The variants for a giver concept are synonyms of each other. 
An example of the terminology used in the paper is the following: 

• The words maverick and rebel are variants for the nominal concept/synset 06228559 glossed as 
someone who exhibits great independence in thought and action 

• Concept 06228559 above corresponds to the first sense for maverick , and the third sense for rebel. 

The method to be used in order to build the Basque WordNet was a matter of debate in our research 
group. We could either take a sense inventory and a concept hierarchy independent from WordNet 
(derived from a dictionary or our own lexicographic work) and link it to the ILI, or take the English 
WordNet 1.5 concepts and hierarchy and link Basque words to them. 3 

The advantage of the former approach is that we control the sense inventory and hierarchies, building 
them according to our criteria, at the cost of overtaking expensive lexicographic works and having to 
devise a way to link the Basque WordNet to the ILI. The advanlage of the second approach is that we 
have the link to the ILI for free, and that most of the work is jus t reduced to link Basque words to the 
ILI. We ultimately aimed to have the best of both approaches, but we decided to take the English 
WordNet 1.5 as the starting point. In any case, monolingua Basque dictionaries will give us a 
reference for the quality of the sense distinctions made, and we also plan to use the hierarchies and 
lexico-semantic relations extracted from dictionaries to enrich or even transform the Basque WordNet. 
In fact, we are studying a more complex representation for lexico-semantic relations, which could 
depart the Basque WordNet from the actual EuroWordNet design. 

From another point of view, the method to build the Basque WordNet was influenced by the following 
general criteria: we want the Basque WordNet to have Wide coverage and we also want it to be of high 
quality. Coverage involves concepts, entries, parts of speech, word senses and synonyms. Quality can 
be divided in correctness, completeness and adequacy: 

1. Correctness of variants of a synset and word senses of a word. 

2. Completeness of variants of a synset and word senses of a word. 

3. Adequacy of the specificity level for variants in the synset (also wrt. to English equivalents), and 
for word senses. 

We had an additional requirement on the Basque WordNet which conflicts with the above list in the 
short term: we needed fast development with limited resources. This led us to first put the stress on 
coverage, and leave quality enforcement for later. For the same reason, we adopted a semi-automatic 
procedure to build the nominal part (cf. section 2). 


A 


2 In this paper concept and synset refer to the same thing. We are aware that in lexical resources other than WordNet, 
concepts and synsets can be different things. 

3 In EuroWordNet (Vossen et al. 1998a) the former approach is referred to as; the merge model, and the latter as the expand 
mode!. 


33 



The above considerations lead us to devise a three-stage process for the development of Basque 
WordNet: 

• First stage: development of the core Basque WordNet 
For each part of speech: 

1. Link all Base Concepts manually, checking quality. 

2. Automatically generate Basque equivalents for the English concepts. 

3. Concept-to-concept review of the equivalents generated in previous step. The goal is the fast 
development of the core WordNet for this part of speech. The stress is on coverage and speed. 
Linguists focus on correctness of variants in a synset, but also cover completeness of variants in 
the synset. 

4. Word-to-word review of word senses. The goal is twofold: ensure quality across word senses and 
try to cover main senses for most frequent/relevant words. The stress is on quality. Linguists focus 
on the correctness and completeness of word senses for a word. 

5. Check adequacy of specifity level. 

• Extend coverage to all the vocabulary. 

• Second stage: map the Basque WordNet to a Basque monolingual dictionary. Design of an 
enriched representation and extraction of varied lexico-semantic representations from the 
dictionary. 

We want to stress the need of both the concept-to-concept and word-to-word reviews. Both work on 
the same data (word-concept links) but from different complementary angles. 

1.1. Current state of the Basque WordNet 

At present, the Basque WordNet is in the following development state (see section 3 for figures on 
number of synsets and entries): 

• Verbs: Base Concepts and a core subset (similar to SI in EuroWN) have been manually done (step 
1 of first stage). 

• Nouns: Base Concepts have been manually done (step 1). The automatic method was executed, 
and the concept-to-concept review completed (steps 2 and 3) At this stage, we released the 
preliminary version, BasqWN 0.1 which is of similar size and characteristics to the final EuroWN 
releases (Vossen et al., 2001). The word-to-word review (step 4) is being performed at present, as 
well as the check on the specifity level. 

Regarding the human effort involved, the concept-to-concept review (step 3) took 1,640 hours at 
approximately 16 concepts per hour. The word-to-word review has already taken 502 hours, at a rate 
of approximately 2 words per hour, which has involved reviewing around 16 concepts per hour. 

The Basque WordNet is implemented as a relational database with a web interface that allows 
browsing and editing (Benitez et al. 1998). Linguists have access to a set of integrated online lexical 
resources comprising the Spanish and English WordNets, a monolingual Basque dictionary (Sarasola, 
1996), a synonym dictionary (UZEI, 1999), two bilingual English-Basque dictionaries (Morris 1998; 
Aulestia & White, 1990), a bilingual Spanish-Basque dictionary (Elhuyar, 1998), and a terminology 
database (UZEI, 1987). The linguists also use the Oxford Spanish-English bilingual dictionary off-line 
(OUP, 1994). The project currently involves three persons in the supervision committee, one person 
for linguistic coordination, two persons for linguistic work and one person for computer support. 

2. Automatic Generation and Concept-to-Concept Review 


In order to help the linguists in their task, we automatically generated Basque noun concepts from 
machine-readable versions of Basque-English bilingual dictionaries (Morris, 1998; Aulestia & White, 
1990). All translation pairs in the dictionaries were extracted in the form of English term-Basque term 
{term here refers to a single word or multiword phrase). These pairs were combined with WordNet 


34 



synsets and the resulting combinations were analyzed following the class methods (Atscrias et al. 
1997). The algorithm produces triples like Basque word - synset - confidence ratio. The confidence 
ratio is assign depending on the results of the hand evaluation, which is organized around different 
submethods. The pairs produced by class methods with a confidence rate lower than 62% were 
discarded. Next section presents the figures for this step. 

The results of the previous process were validated by hand. The linguists reviewed the synsets that had 
a Basque equivalent one by one, checking whether the Basque words were correctly assigned and 
adding new words to the synonym set if needed. This process can be sped up if the linguist treats 
related synsets, i.e. when the next synset to be treated is related to the previous one. In order to 
facilitate the manual work, we used an interface linked to a database (Benitez et al., 1998) that 
allowed for simultaneous updates and accesses. In order to offer the synsets following hyperonym 
chains, we added an extra button that lead to the next synset in the hyperonym chain that was yet 
undone. In addition, there was the possibility to mark a synset as “dubious”. This was used when 
further lexicographic analysis was needed. In this case, the linguists would meet with the linguistic 
coordinator. Next section presents the figures for this step. 

We have to note that following the above procedure no new concepts are added to the ILI, i.e. we do 
not add concepts that are not already available in the English WordNet. 

3. Quantitative State of BasqWN 0.1 

Table 1 reviews the amount of synsets, entries, etc. of the Basque WordNet compared to WordNet 1.5 
and the EuroWordNet final release (Vossen et al. 2001). The first two rows show the number of Base 
Concepts, which were manually done. For nouns in the Basque WordNet 0.1, the Nouns(auto) row 
shows the figures as produced by the raw automatic algorithm , and the Nouns(man) row the figures 
after the manual concept-to-concept review. The number of entries was manually reduced down to 
50%, and the number of senses down to 15%. This high number of spurious entries and senses is 
caused primarily by a high number of orthographic and dialectal variants that were introduced by the 
older bilingual dictionary, which does not follow the standard rules adopted in the last years. 

There is an extra set of 3,000 synsets that are currently under review. This set comprises mainly 
multiword terms with dubious Iexicalization in Basque, i.e., the Basque equivalent is a definition 
rather than a term: 

maverick (unbranded range animal esp. a stray calf): markatu_gabekoJxahal 
breakthrough (making an important discovery): garrantzihandikoaurkikuntza 








SensJetttry 

Basque 

WordNet 

Nominal BC 

228 


IHBMi 

! r\, jb 

- 

Verbal BC 

792 

- 

- 

- 

- 

Basque 
WordNet 0.1 


27641 

291011 

10.52 

46164 

6.3 


23486 

41107 

1.75 

22166 

1.8 

Verbs (man) 

3240 

9294 

2.86 

3155 

2.95 

WN1.5 

Nouns 

60557 

107484 

1.77 

87642 

1.23 

Verbs 

11363 

25768 

2.27 


1.75 

Dutch 

Wordnet 

Nouns 

34455 

54428 

1.58 


mmm 

Verbs 

9040 

14151 

1.57 

■EE3 

■HED 

Spanish 

Nouns 

18577 

41292 

2.22 

23216 

1.78 



£TH\G 


'mo 

o no 

Italian 

Wordnet 

Nouns 30169 

34552 


1.39 

Verbs 8796 

12473 

1.42| 6607 

1.89 


Table 1: Figures for the Basque WN release 0.1 compared :o WN 1.5 and the final EWN release. 


35 






























































The senses per entry figures are higher than those from WN 1.5 and most of the WordNets, but similar 
to the Spanish WordNet. The fact that the nouns and verbs included are in general more polysemous 
can explain this fact. We also performed an analysis of the distribution for the variants in each synset 
and the number of word senses per entiy. From this analysis we found that many words had a 
polysemy degree higher than 15, a figure that is not observed in the English WordNet. This fact is 
analyzed in more detail in section 4.2. 

All in all, the amount of synsets and entries for the Basque WordNet 0.1 is comparable to those for the 
WordNets produced in EuroWordNet, but lower than the WN 1.5 release. The coverage of concepts in 
WN 1.5 is 38% and the coverage of entries in a Basque monolingual dictionary is 100%. 

4. Qualitative Analysis of BasqWN 0.1 

Somehow, we were not satisfied by the quantitative analysis and the results of the concept-to-concept 
review. On the one hand, the quantitative analysis only shows the state of the coverage ; of concepts 
and entries, as long as they are compared to reference figures from WordNet (concepts) and Basque 
reference dictionaries (entries). It is rather difficult to assess the coverage of the number of word 
senses and synonyms 4 , as these can only be compared to WordNet, but there are no reference figures 
for the Basque WordNet itself. We think that the coverage of word senses and synonyms can be more 
reliably estimated measuring by hand the completeness of the word senses of a sample of words and 
the variants for a sample of concepts. 

On the other hand, the concept-to-concept review only enforces the correctness and completeness of 
the variants in the synset. As the stress of the first stage was on quickly producing a first version, 
correctness was more important than completeness, and we were not completely satisfied with the 
completeness of the variants. 

As already listed in section 1, these are the correctness, completeness and adequacy requirements that 
were not covered by the quantitative analysis: 

a. Correctness of word senses of a word. This is indirectly enforced by the concept-to-concept 
review, but the linguist tends to focus on the list of variants itself. Additional cross-validation 
could be needed. 

b. Completeness of word senses of a word. This can only be enforced by comparison to lexical 
resources listing sense inventories. This could be used to provide an estimate of the coverage for 
word senses. 

c. Completeness of variants of a concept Further review could be necessary, which could provide an 
estimate of the coverage for synonyms. 

d. Adequacy of the specifity level for variants in a concept, i.e. all variants of a concepts are of the 
same specificity level. 

e. Adequacy of the specifity level for word senses, i.e. granularity of word senses. 

In order to assess points a, b and e, we performed a manual comparison and mapping of the word 
senses given by BasqWN 0.1 with those of a monolingual dictionary and a bilingual dictionary. This 
assessment is presented in the next subsection. - 

I 

We have also manually checked the correctness and completeness (c) of the variants for a concept; 
using a synonym dictionary for this purpose. The results were highly satisfactory, but we decided to 
explicitly include the use of the synonym dictionary in all subsequent reviews and updates of the 

RasnilP (ooo nav* 


4 Coverage of synonyms is linked to the number of variants in each synset. As explained in section 1, we use the terms 
variant and synonym to name the words that lexicalize a synset. 


36 


Subsection 4.2 presents some preliminary assessment of the adequacy of the specificity level for 
variants in a concept (d). 

4.1. Manual Mapping of Word Senses from BasqWN and Basque Dictionaries 

The sense partition of Basque monolingual dictionary reflects a suitable native sense partition, and 
needs not to be of the same granularity as of WordNet. In pr.nciple, both sense partitions could even 
be incompatible, in the sense that it could involve many-to-many mappings. 

We chose to use the Euskal Hiztegia (EH) dictionary (Sanisola, 1996), as it is a general purpose 
monolingual dictionary, and it covers standard Basque. It contains 33,111 entries and 41,699 senses. 
Besides, it is the dictionary used for the Basque task in the Senseval 2001 competition. One drawback 
of this dictionary is that it mainly focuses on literature tradit on, and it lacks many entries and word 
senses which are more recent. For this reason, we decided to include also a bilingual Basque-English 
dictionary (Morris, 1998). Moreover, if the linguist thought t iat some other word sense was missing 
he/she was allowed to include it. 


The linguist was also required to leave out some senses from EH, particularly those tagged as nuances 
of other senses corresponding to rare usages. 


Word 

Number of 
senses 

Senses Added From 
Bilingual 
Dictionary 

Senses Added From 

Monolingual 

Dictionary 

Senses Added 
By Linguist 


9 

0 

3 

0 

ITENTSIO: tension 

9 

3| 

0 

1 


22 

3 

4 

0 

GAI: material, matter 

16 

0 

0 

1 

EGUN: day 

9 

1 

1 

0 

ELIZA; church 

' 4 

0 

1 

0 

MASA: mass 

4 

1 

0 

0 

UR: water 

5 

2 

0 

0 

KANAL: canal 

8 

1 

0 

0 

KOROA: crown 

9 

0 

1 

0 . 

IBILBIDE: course, way 

8 

3 

0 

— 

0 

KANTU: song 

3 

0 

1 

0 

KAP1TAIN: captain 

5 

0 

I 

0 

LANTEGI: factory 

5 

1 

0 

0 

ENPLEGU: job 

1 

1 

1 

1 

Total 

117 

13 

13 

3 


Table 2: Word senses missing from BasqWN 0.1 


Table 2 shows the result of the comparison of the sense inventories. The first column lists the Basque 
words alongside some of their possible translations. The second column shows the number of word 
senses per word in BasqWN. The third, fourth and fifth columns list the number of word senses added 
according to the bilingual dictionary, monolingual dictionary aid the linguist’s introspection. 

All in all, both bilingual and monolingual dictionaries contribute equally to the new senses. An 
average oi l .y new senses are added tor each word, which makes an average of 0.24 new senses for 
each existing sense. This makes an idea of the completeness of the word senses for words. All word 
senses were found to be correct. These figures can be interpolated to estimate that the coverage of 
word senses for the entries currently in BasqWN is around 80%. 

Regarding the mapping between the word senses of BasqWN end the monolingual dictionary, most of 
the times it was one-to-one or many-to-one. The granularity cf the word senses in BasqWN is much 


37 








































































finer. We have even found a sense in the Basque monolingual dictionary that accounts for 9 BasqWN 
word senses. We have not found many-to-many mappings. We have to note that some of the word 
senses in BasqWN were not present in the Basque dictionary, as illustrated in the following word 
senses in BasqWN: 

CROWN ( koroa) an English coin worth 5 shillings 

CAP {koroa) an artificial crown for a tooth 

DATE ( egun) + the particular year that an event occurred 

This analysis has also uncovered some other difficulties such as the incompatibility concerning parts 
of speech. For instance, one of the senses of the word herri can only be used in compounding and the 
English equivalent would be that of the adjective public. In order to map both senses, we need to find 
a nominal synset equivalent to public (or introduce a new nominal synset with this meaning) and 
introduce a cross-PoS equivalent relation between the nominal and adjectival concepts. These kind of 
phenomena cannot be detected in concept-to-concept review. 

4.2. Adequacy of the Specificity Level of Variants in Synsets 

As already mentioned in the quantitative analysis, we found out that some words had an unusually 
high number of word senses. Quick hand inspection showed that for some concepts the variants were 
of heterogeneous specifity. In fact, we suspected that some words were placed in too many concepts. 
An example follows: 

The concept religious glossed as a member of a religious order has the following Basque variants: 
erlijioso, serora, lekaide (respectively translated as religious, nun, monk) 

Two of the Basque variants are “correct” in some way, but they are wrongly placed in the hierarchy. 

In fact, a program that searches for words that have two word senses, one hyperonym of the other 
found out that there are 4500 such pairs out of 41107 word senses. This is a very high figure compared 
to the English WordNet, and indicates that we need to check those word senses. 

In other cases, the problem is more difficult to detect For instance: 

The concept superior glossed as the head of a religious community has the following Basque variants: 
nagusi, burn (respectively translated as boss, head) 

In this case, there is not direct equivalent for superior in Basque, and burn is correctly used to 
lexicalize this concept. From a specificity point of view, we could say that burn is the Basque 
hyperonym for superior , but from another point of view, it is a perfectly plausible translation 
equivalent in Basque. 

5. Conclusions of the quantitative and Qualitative Analysis of BasqWN 0.1 
A summary of the quality assessment for the nominal part of the Basque WordNet is the following: 

• Coverage of concepts: 38% of the concepts in WordNet 1.5. 

• Coverage of entries: 22,166 entries, which makes 25% of WordNet 1.5, but accounts for all 
Euskal Hiztegia entries. 

• Coverage of senses: estimated as 80% of the senses for .the entries already in BasqWN 0.1. 

» Average o t synonyms: estimated as complete for the present concepts. 

• ^Correctness of variants of a synset: estimated to be correct for all variants. 

• Completeness of variants of a synset: estimated as all variants being present.* 


38 



• Adequacy of the specificity level of the variants of a synsct: we have some evidence that in some 
instances one of the variants are at the wrong level of the hierarchy, usually too high. 

• ’ Correctness of word senses of a word: estimated as all senses being correct. 

• Completeness of word senses of a word: estimated as 20% of the word senses being missing. 

• Adequacy of the specificity level of word senses of a word (granularity): the specifity level of 
BasqWN is much finer than that of the reference dictionary. This does not need to be a problem. 

A number of other issues have also been observed: 

• We have around 3000 nominal concepts with dubious lexic alizations. 

• Some words have a very high polysemy degree. 

• We found incompatibility of part of speech for a number o: ? word senses. 

6. Word-to-Word Review 

Most of the shortcomings detected in the previous section cai be overcome following an additional 
review of the current BasqWN 0.1. In this review we want to ensure that the coverage of word senses 
is more complete, trying to include the estimated 20% of word senses that are missing. In this case, the 
review is to be done studying each word in turn and taking attention to the following issues: 

• Coverage of senses: add main word senses of basic words. 

• Correctness of word senses of a word: delete inadequate word sense when necessary. 

• Completeness of word senses of a word: add main word senses. 

• Adequacy of the specificity level of word senses of a word (granularity): check that granularity of 
word senses is balanced. 

The need to build a core WordNet lead us to define a subset of the nominal entries to be covered: on 
the one hand, the top 400 words from a frequency analysis were treated, on the other hand, the entries 
in a basic bilingual Basque-Spanish dictionary (Elhuyar, 1998) which tries to define the core 
vocabulary of Basque (13,000 nouns). The word senses are provided by the monolingual dictionary 
(EH) and the bilingual dictionary. The bilingual dictionary includes modem words and word senses 
which are not in EH. 

7. Summary and Further Work 

This paper has presented a methodology that tries to integrate :he best of development methods based 
on the translation of the English WordNet and development methods based on a native dictionary. We 
first have developed a quick core WordNet comparable to the final EuroWordNet released using semi¬ 
automatic methods that includes a concept-to-concept manual review, and later performed an 
additional word-to-word review based on native lexical resources that guarantees the quality of the 
WordNet produced. 

We are currently extending the coverage of the noun entries and word senses to those in a basic 
vocabulary of Basque. In the future we plan to apply the methodology to verbs and adjectives and to 
extend the coverage to a comprehensive set of noun entries. In addition, we would like to check the 
coverage for entries and word senses of nouns in a Cross-lingual Information Retrieval application. In 
parallel, we are planning to map the word senses in a Basque monolingual dictionary to the Basque 
WordNet which would nllnw ro imnnrt Ip-yicn-serrmntif’ rpl^tinrsc frnm fhp 'R'icrrnp dirtirmoi-w (A «r« tt= 
& Lersundi, 2001) to EuroWordNet. 

Acknowledgments 

This research was partially funded by the European Commision under the Feder program (project 
2FD1997-1503), the Spanish Ministry of Science and Technology (project Hermes TIC2000-0335- 


39 



C03-03), the University of the Basque Country (UPV 141.226-G19/99) and the Provincial 
Government of Gipuzkoa (project Berbasare OF 206/00). We would like to thank the research team of 
the Technical University of Catalonia for sharing their algorithms and tools with us. 

References 


Agirre, E. and Lersundi, M. (2001) Extraction de relaciones lexico-semanticas a partir de palabras 
derivadas usando patrones de definition. In Proceedings of SEPLN 2001. Jaen (Spain). 

Atserias, J., Climent, S., Farreras, J. Rigau, G. & Rodriguez, H. (1997) Combining Multiple Methods 
for the Automatic Construction of Multilingual WordNets. In proceedings of Conference on Recent 
Advances on NLP. (RANLP’97). Tzigov Chark, Bulgaria 1997. 

Benitez, L., Escudero, G., Farreras, J. & Rigau, G. (1998) WWI: A Multilingual WordNet Interface 
using the Web. Technical Report LSI-98-6-T. LSI Department, Universitat Politecnica de Catalunya. 
Fellbaum, C. (1998) WordNet: An electronic Lexical Database. The MIT Press,Cambridge, 
Massachusetts. London, England. 

UZEI. (1987) Euskalterm. http://www.uzei.com/en/euskalter.htm (20 Sep. 2001). 

Vossen, P., L. Bloksma, S. Climent, M. Anonia Marti, G. Oreggioni, G. Escudero, G. Rigau, H. 
Rodriguez, A. Roventini, F. Bertagna, A. Alonge, C. Peters, W. Peters. (1998) The Reestructured 
Core wordnets in EuroWordnet: Subset 1. EuroWordNet(LE-4003) Deliverable DO 14/DO 15, 
University of Amsterdam. 

Vossen, P., L. Bloksma, S. Climent, M.A. Marti, M. Taule, J. Gonzalo, I. Chugur, M. F. Verdejo, G. 
Escudero, G. Rigau, H. Rodriguez, A. Alonge, F. Bertagna, R. Marinelli, A. Roventini, L. Tarasi. 
(1998) EuroWordNet Subset2 for Dutch, Spanish and Italian , EuroWordNet (LE-4003) Deliverable 
D027/D028, University of Amsterdam. 

Vossen, P.; L. Bloksma; S. Climent; M.A. Marti; M. Taule; J. Gonzalo; I. Chugur; M. F. Verdejo; G. 
Escudero; G. Rigau; H. Rodriguez; A. Alonge; F. Bertagna; R. Marinelli; A. Roventini; L. Tarasi., W. 
Peters. 2001. Final Wordnets for Dutch, Spanish, Italian and English, EuroWordNet (LE2-4003) 
Deliverable D032/D033, University of Amsterdam. 

Dictionaries 

Aulestia, G. & White, L. (1990) English-Basque Dictionary, University of Nevada Press, Reno. 
Elhuyar. (1998) Hiztegi txikia. 

Morris, M. (1998) Morris Hiztegia. 

OUP. (1994) The Oxford Spanish Dictionary , Oxford University Press, 1994 
Sarasola, I. (1996) Euskal Hiztegia. 

UZEI. (1999) Sinonimoen Hiztegia. 

Eneko Agirre, IXA NLP Group, University of the Basque Country, 649 pk. 20.080 - Donostia. Spain. 

Olatz Ansa, IXA NLP Group, University of the Basque Country, 649 pk. 20.080 - Donostia. Spain. 

Xabler Arregi, IXA NLP Group, University of the Basque Country, 649 pk. 20.080 - Donostia. Spain. 

Jose Marl Arriola, IXA NLP Group, University of the Basque Country, 649 pk. 20.080 - Donostia. Spain. 
Arantza Diaz de llarraza, IXA NLP Group, University of the Basque Country, 649 pk. 20.080 - Donostia. Spain. 

Ell Pociello, IXA NLP Group, University of the Basque Country, 649 pk. 20.080 - Donostia. Spain. 

Larraitz Uria, IXA NLP Group, University of the Basque Country, 649 pk. 20.080 - Donostia. Spain. 


40 



Expanding EWN with Domain-Specific Terminology Using Common 
Lexical Resources: Vocabulary Completeness and Coverage Issues 


ABSTRACT 


Stamou Sofia, Ntoulas Alexandras 
Kyriakopoit lou Maria, Christodoulakis Dimitris 


EuroTerm is a multilingual semantic network comprising domain-specific terminology for Greek, 
Dutch and Spanish, which will be linked to the EuroWordNet lexical database. Two approaches have 
been widely adopted for the development of WordNets, namely the merge and the expand model. The 
former is considered as the one that ensures a better representation of language particularities in a 
lexical database whereas the latter assures sufficient overlap in the coverage of WordNets. For the 
development of EuroTerm a combination of both models was followed in order to ensure vocabulary 
completeness and coverage across concepts. 

7. Introduction 

EuroTerm 1 is an EC funded project (EDC-2214) that aims at developing a multilingual 
domain-specific lexical database, consisting of individual WordNets in three European 
languages (Greek, Dutch and Spanish), which will be incorporated into the EuroWordNet 
(EWN) semantic network. EWN (Vossen, 1996) is a lexical database representing semantic 
relations among basic linguistic concepts for eight West European languages, which are 
efficiently linked through the usage of an unstructured set of English concepts, namely the 
Inter-Lingual-Index (ILI). The main goal of EuroTerm is to expand EWN and the ELI with 
domain specific terminology by incorporating in it an “Environmental* 4 domain and an 
“Environmental Health” sub-domain. The domain-specific WordNets will be stored in a 
common lexical database, which will be linked to the central EWN database. Deviations from 
EWN might be due to different structure of the lexical resources and quality of the tools used 
for the terminology acquisition. Two different models have been used for developing 
WordNets, the merge and the expand model. The first one implies the independent 
development of each monolingual WordNet and then their linking to the most equivalent 
synset in the ILI (Vossen, 1996) whereas the expand model implies the translation of English 
concepts in other languages. For the development of EuroTerm, a combination of both models 
has been followed in order to assure compatibility and maximize control over the data across 
the individual WordNets while maintaining language-dependent differences. More 
specifically, the actual development of WordNets took place in two phases. During the first 
phase common English lexical resources were used for the terminology acquisition trying to 
ensure a common starting point for all languages in terms of quality and quantity of the 
resources. Consequently, the expand model was followed during the first phase of WordNets’ 
development. During the second phase the merge model has been adopted in the sense that 
monolingual lexical resources were used for the extraction of the environmental terminology 
to be incorporated in the monolingual WordNets. Since compatibility with the EWN database 
is desirable, the Language Independent Module, the Top-Concept Ontology and the ELI of the 
EWN are maintained. The only expansion concerns the incorporation of an “Environmental” 
domain label to the already existing ones. In the following section (2) we briefly present 
nrevious work conducted in this area and we continue with a description of our approach 
towards terminology acquisition (3) with emphasis given on the expand model. Following on 
from this, a discussion of the obtained results is provided (4) and we pinpoint some coverage 
and completeness issues. In the remaining sections we present some applications of the 
EuroTerm domain specific network (5) and we conclu de with an overall assessment (6) of our 
approach. 


1 EuroTerm EDC-2214, “Extending the EuroWordNet with Public Sector Terminology” funded by the EC. 



2, Related Work 

A feasibility study on the incorporation of domain specific terminology into EWN has been 
conducted while considering at the same time other kinds of extensions of the ILI records. In 
order to test how EWN can be extended for a conceptual domain, the ILI was adapted for the 
domain of computing (Vossen, 1999), which can be accessed via the domain labeling. For 
achieving that, common English lexical resources were used while terminology in other 
languages relied, as much as possible, on individual resources containing translational 
equivalencies for the selected computer terms. The final set of the selected terms consists of 
444 terms most of which were assigned the computing domain label, and the remaining were 
spread under six sub-domains of computing (Vossen, 1999), which were then added under the 
EWN “Computer_Terminology” label. 

3. Applying the Expand Model for Terminology Acquisition: the selection phase 

The basic idea during the first phase of terminology acquisition is that terms to be added into 
the EWN database under the “Environmental” domain and “Environmental Health” sub- 
domain labels would be selected and verified from common English resources (both corpora 
and lexica) so that there is a common starting point in terms of quantity and quality. By 
starting off with a common set of terms we ensure that the core of the individual WordNets 
are richly encoded and comparable, since they have the same conceptual coverage. The 
corpora used for the terminology extraction, consist of 429 environmental documents 
comprising a total of 1,733,869 terms, and were manually collected from the following URLs: 

• EEA, European Environment Agency (http://eea.eu.int) 

• EPA, United States Environmental Protection Agency (http://www.epa.gov) 

• Greenpeace (http://www.greenpeace.org) 

• NOAH, New York Online Access to Health (http://www.noah-health.org) 

• NRDC, Natural Resources Defense Council (http://www.nrdc.org) 

The need for domain specific glossaries is more apparent however, when dealing with large 
amounts of relatively unstructured information, stored in various formats. Thus, various 
English glossaries comprising 4,972 environmental terms in total with their glosses have been 
collected in order to facilitate checking and verification of the quality and coverage of the 
terminology extracted from the corpora. For the terminology extraction the selected 
documents were converted to plain ASCII format, were POS tagged 2 in accordance with the 
Penn Treebank POS tags (Santorini, 1990) and lemmatized in order to ensure that all 
individual lemmas are detected even if they share common wordforms with each other. Once 
the previous process was completed we obtained a list of terms found in the corpora followed 
by their lexical category and lemma. The list was in the following format: <word> <tag> 
<lemma>. Lemmatization process did not deal with single terms only, but with compounds as 
well. As compounds, we considered two or more consecutive nouns and disregarded other 
compound categories (e.g. noun-adjective) due to time constraints and due to the fact that the 
environmental WordNet had to contain mainly nouns (~85% of the terms had to be nouns). 
Lemmatized lists were processed and the occurrence frequency of each term was obtained in 
the following form for single terms: <lemma> <lemmaCount> <lemmaFrequency> and for 
compounds: <wordl >, <word2 >, <word3> <count> <compoundFrequency> . Examples of 
the lemmatized and frequency lists are illustrated in Table 1 & 2 respectively whereas 
examples of compound terms along with their frequency weights are illustrated in Table 3. 
Extracted terms were sorted in terms of frequency, calculated on the basis of TF*IDF values. 
TF*IDF metrics include a component based on the frequency of the term in a document (TF) 


2 The POS-tagger and lemmatizer were provided by Center Applied Research (Tilburg University) 



and a component based on the inverse of the frequency within the. document collection (IDF). 
Usually, the TF*IDF values are applied on diverse document collections in order to measure 
relevance of a term for a single document. However, We applied EF*IDF metrics for more than 
one document collection since we : ;dealt with a quite homogeneous collection in terms of 
conceptual and vocabulary coverage. The basis for measuring terrti’s importance is based on 
its frequency weight within the document. Lemmas with high IDF value tend to appear' in 
fewer documents but hold a more precise meaning, whereas lemmas with low IDF scores (e.g. 
thing, state) tend to occur in most of the documents. Frequency lists, consisting of ~ 18,000 
terms were tokenized and stop words were eliminated For the purpose of EuroTerm 3 terms 
having a POS tag other than noun (NN, NNS), verb (VB, VBD, VBG, VBN, VPZ) or 
adjective (JJ, JJR, JJS) were considered as stop words and thus excluded from the list of 
candidate terms. 


<word> 

<tac> 

<lemma> 

toxics 

NNS 

toxic 

wastes 

NNP 

waste 

cultivated 

VBV 

cultivate 

forests 

NNP 

. 

forest 

ozone 

NN 

ozone 


Table 1: Lemmatized list example 


lemma 

LemmaCount 


% lemmas 4 

water 

6844 

WHmSKm 

0.4286% 

waste 

5679 

0.00355.6 

0.3556% 

ozone 

3613 

0.002262 ' 

0.2262% 

forest 

716 

0.000448 

0.0448% 

thing 

81 

0X100050 

0.0050% 


Table 2:Frequency list e::ample 


First Word 

Second Word 

Third Word 

Count 

Comnound Freauencv 

lung 

cancer 

. _ 

81 

0.004036 

solid 

waste 

IH?r! I 

60 

0,005670 

drinking 

water 

— 

56 

0,002790 • 

carbon 

dioxide 

emissions 

27 

0,002151 


Table 3: List of Compounds with frequency weights attached example 


One of the criteria applied for the selection of the terminology to be incorporated into the 
EuroTerm database is the occurrence frequency of terms within the corpus. In addition, the 
presence of a wordform in the ILI was examined in order to ensure that terms to be included 
are not already present. In case an environmental term already existed in the ILI an 
environmental label was attached to it indicating that tie underlying term is domain-specific. 
Otherwise, if a term was present but without an environmental sense then an environmental 
gloss was attached to it, which was extracted form the English environmental glossaries 
described above. Incorporation of new terms in the ILI was performed through a Terminology 


3 Only nouns, verbs and adjectives will be incorporated into the EuroTerm domain specific network. 

4 The fourth column gives the percentage of the term-frequencies and it is provided for a better understanding of 
the figures 


43 












































Alignment System (TAS) developed within the framework of the project 5 . New terms were 
added to the ILI through the TAS with term identifiers to the local WordNets. The final set of 
candidates was translated in each of the three languages with the use of bilingual dictionaries. 
Then, monolingual synsets were developed for each of the translated terms and were added 
under the “Environmental” domain label of each WordNet. At the end of the first phase the 
core monolingual WordNets for the three languages had been developed according to the 
expand model. The expand model reassured a reasonable level of overlap across monolingual 
WordNets but there was a risk that terminology might be biased by English lexicalizations 
since experiments have shown that there is a considerable variation in the way semantic 
information for equivalent words is coded across languages. In order to overcome such 
problems we decided to follow the merge model during the second phase of synset 
development and use monolingual lexical resources for the expansion of the core WordNets, 
thus achieving a higher degree of consistency. This model is followed in the ongoing work,' 
and some preliminary results 6 show that the quality of the outcome is going to meet our 
expectations. It should be noted, finally, that we did not follow the merge model from the very 
beginning due to the hypothesis that specialized concepts limited to a specific domain tend to 
share a single meaning overcoming thus lexical ambiguity problems. In addition, due to time 
considerations and due to the fact that domain specific lexical resources were not widely 
available for the participating languages we decided to start with a common set of English 
resources and then enrich the extracted terminology with concepts derived from monolingual 
resources. An assessment of the results obtained during the first selection process is discussed 
in the following section with emphasis given on vocabulary coverage and completeness. 

4, Discussion of the Obtained Results: Coverage and Completeness Issues 

According to Hearst surprisingly useful lexical information can be obtained by applying 
simple analysis techniques on unrestricted texts (Hearst, 1998). However, the set of concepts 
that need to be covered by a semantic network cannot be readily obtained from a (domain- 
specific) corpus (Buitelaar, 2001). Thus, our basic hypothesis is that a domain-specific 
WordNet should comprise of corpus-representative terms in conjunction with terms found in 
domain-specific lexica. During the first phase of the project, domain-specific sense 
assignment was semi-automatically performed using a manually constructed environmental 
corpus. In this section we report on the methodology followed for determ ini ng domain- 
specific relevance of terms extracted from the corpus. 

Occurrence frequency is not by itself a sufficient indicator of a term’s importance in a 
document since terms of high frequency tend to hold many senses. As reported by Krovetz 
and Croft (Krovetz, 1992) word senses have skewed frequency distribution and an anomalous 
frequency distribution can be useful for determining domain-specific senses of general 
vocabulary terms. In addition, they found that general vocabulary terms holding also a 
domain-specific meaning appear to have low frequency but they might also have high 
semantic ambiguity. In order to overcome the first problem and reassure that the extracted 
terminology would be representative of the domain of environment we applied the standard 
TF*IDF values to measure the importance of terms based on the hypothesis that weighting 
words in inverse proportion to their number of senses should give similar effectiveness to 
weighting based on inverse collection frequency (Krovetz, 1992). Moreover, in order to deal 
with semantic ambiguity we extensively used environmental glossaries in order to check 
which of the extracted terms had an environmental sense. Checking terms against 


* TAS has been developed by CentER AR with overall responsibility of Dr. Jeroen Hoppenbrouwers 
Unfortunately, since results lacked a consistent state at the time of this contribution, they are not shown here. 


44 



environmental glossaries was conducted semi-automatica lly and took place in two phases. 
During the first phase all terms extracted from the corpus were stemmed, 7 converted to 
lowercase and then automatically checked (string matching) against glossaries in order to 
detect which of them were already present in the glossaries. Terms present both in the corpus 
and the glossaries were considered as candidate terms. During the second phase 
terminologists manually checked the glosses of the candidate terms as given by glossaries in 
order to decide which of them held an environmental sense. The candidate terms that had also 
an environmental sense were the ones that formed the core environmental WordNet. Some 
preliminary results show that most of the terms extracted from the corpus do hold an 
environmental sense even if they belong to the general vocabulary. This is justified partly due 
to the uniformity of the conceptual domain of our corpus. On the other hand lexical ambiguity 
was dealt on the basis of the domain-specific glossaries against which terms were checked. As 
pointed out by Pirkola (Pirkola, 1998) terms of special dictionaries are often unambiguous 
and thus specialized semantic classes limited to a specific context are non-polysemous ones. 
By using domain specific glossaries we reassured that all terms included in our WordNet 
would be related to the conceptual domain of environment. In addition, each term included in 
the ILI had also a gloss attached in order to reassure that the correct sense of the term was 
present. Another way of measuring conceptual relatedness of a term to a domain could be 
through term collocations extracted from the corpus. However, we decided to use domain- 
specific glossaries instead of collocations in order to overcome lexical ambiguity problems 
and thus maximize conceptual coverage and completeness of the environmental domain. 
Through the proposed approach, vocabulary completeness and coverage can be assured since 
a combination of term frequencies extracted from corpora and information from domain- 
specific lexica were used for selecting terminology representative of a conceptual domain. 

5. NLP Application of EuroTerm 

EuroTerm can be used as a resource for semantic information in many NLP applications 
varying form Information Retrieval (IR) to dictionary publishing as a means to separate 
generic from domain-specific vocabulary. One envisaged application concerns the 
incorporation of the domain specific network in IR systems in order to test its performance 
against domain-dependent text retrieval. Our objective focuses on using EuroTerm database 
to index documents and queries not in terms of wordforms but in terms of their conceptual 
meaning. The main idea we adopt against conceptual indexing is targeted towards a semantic 
representation of documents in the index and their mapping to the correct synsets of the 
environmental domain. Consequently, we aim at conceptual text retrieval as opposed to exact 
keyword matching, since a document might be relevant to a search request even if it does not 
use the same words with the query. EWN has shown a potential for IR but the lack of word- 
sense disambiguation still handicaps the development of concept-based text retrieval 
(Gilarranz, 1997). A possible explanation for the limited performance of EWN in IR tasks 
might be the differentiation of conceptual equivalencies across languages, which in some 
cases account for diverging mappings from local WordNets to the ILI concepts, meaning that 
conceptual equivalencies are sometimes linked to distinct ILI concepts reflecting different 
senses of the same word (Peters, 1998). However, in EiroTerm such problems are rarely 
faced firstly due to the semantically restricted domain of concepts it contains and secondly 
due to the fact that the core individual WordNets are based on a common set of environmental 
terms extracted from common resources. Thus, the environmental network can be directly 
used in IR applications as a way of clustering semantically related concepts, resulting in better 
precision scores of the obtained results when it comes to domain-specific text retrieval. 


7 Stemming included only suffix-stripping, i.e. we simply eliminated common suffixes of terms (e.g. -s, -es). 


45 



Em ° Tenn ™U not solve all problems related to IR but at least an integrated 
ordNet that anchors specialized terminology in generic vocabulary opens wider possibilities 
to develop applications for non-expert users. 


6. Conclusions and Future Plans 


m™ aVe ff nte 3 combuiation of the mer ge and expand models followed for enriching 
LWN with domain specific terminology achieving maximal overlap and compatibility across 
languages. Also, we discussed how vocabulary coverage and completeness can be assured 
when using common resources as a starting point for building semantic networks and how the 
latter can be incorporated in an IR environment. Future work includes evaluation of the two 
models in order to conclude on their contribution to domain-specific terminology acquisition. 

Acknowledgements 


W ® wish to thank all partners of the project, namely Center for Applied Research (Tilburg University) 
and Department of Software & Computing Systems (Alicante University) for their valuable 
comnbunon. We would also like to thank the reviewers for their comments and Dr. Piek Vossen and 
Prof. Chnstiane Fellbaum for their support. 


References 


Buitelaar P Sacaleanu B. (2001) Ranking and Selecting Synsets by Domain Relevance In Proceedings 

,°? er o CXiCal . Resources: Applications, Extensions and Customizations, 
NAACL 2001 Workshop, Carnegie Mellon University, Pittsburg, 3-4 June 2001 

Gilarranz J„ Gonzalo J„ Verdejo F„ Stanford CA (1997) An Approach to Conceptual Text Retrieval 
Using the EuroWordNet Multilingual Semantic Database . In “Working Notes of AAAI Spring 
Symposium of Cross-Language Text and Speech Retrieval” 


Hearst M. (1998) Automated Discovery of WordNet Relations. In “WordNet: An Electronic Lexical 
Database” pp. 131-151, ed. Christiane Fellbaum, MIT Press. 

Krovetz R. Croft B. (1992) Lexical Ambiguity and Information Retrieval. In “Proceedings of the CAN 
Transactions on Information Systems”. Vol. 10(2) pp. 115-141. 

Peters W., Peters I., Vossen P., (1998) Automatic Sense Clustering in EuroWordNet. In “Proceedings 
of the LREC Conference” 

Pirkola A. (1998) The Effects of Query Structure and Dictionary Setups in Dictionary-Based Cross- 
Language Information Retrieval. In “Proceedings of the 21 51 Annual International ACM SIGIR 
Conference”, Melbourne Australia, pp.55-63. 

Santorini B. (1990) Part-Of-Speech Tagging Guidelines for the Penn Treebank Project. Technical 
Report MS-CIS-90-47, Department of Computer and Information Science, University of 
Pennsylvania. 


Vossen P. (1996) Right or Wrong: Combining Lexical Resources in the EuroWordNet Project. In 
“Proceedings of the Euralex Conference”, pp. 715-728. 

Vossen P., Bloksma L., Peters W., Kunze C., Wagner A., Pala K., Vider K., Bertagna F. (1999) 
Extending the Inter-Lingual-Index with new Concepts. Deliverable 2D010, Euro WordNet LE2- 
4003 


. ~ ** ; < 

Sofia Stamou {stamou@cti.gr}, Alexandras Ntoulas {ntoulas@jcti.gr}, Maria Kyriakopoulou 
{kyriakop@cti.gr} and Dimitris Christodoulakis {dxri@cti.gr} can be reached at the Databases 
Laboratory, of Computer Engineering & Informatics Department, Patras University, GR26500, 
Greece. 


46 



Semi-Automatic Construction of Korean Noun Thesaurus 
by Utilizing Monolingual MRD and an Existing Thesaurus 

Juho Lee 
Koaunghi Un 
u . • Key-Sun Choi 

Abstract 

Since thesaurus is used as a knowledge resource in many natural language processing 
systems, it is very useful and necessary for the high quality systems, especially for 
dealing with semantics. In this paper, we introduce a semi-automatic method for the 
construction of Korean noun thesaurus by utilizing a monolingual MRD and an exist¬ 
ing thesaurus. 

1 Introduction 

Thesaurus 1 or wordnet takes too much time and effort to construct them manually. In this 
paper, we introduce a semi-automatic method for the construction of Korean noun thesaurus 
with lexical mapping to each noun’s sense. In this method, an MRD (Hangeul Society, ed., 
1997) and an existing translated thesaurus (Ikehara, et al. 1997) are used. By assigning the 
semantic category of the existing thesaurus to each sense 2 of the nouns in MRD, we combine 
these two resources and produce an expanded lexicalized noun thesaurus. The semantic cate¬ 
gory is assigned first and manual correction is performed in post-processing, so thesaurus is 
constructed with relatively high accuracy and small effort. 

2 Related Works 

There were researches on construction of a WordNet using existing WordNet and MRD. In Eu- 
roWordNet project, a multilingual database for several European languages was built based on Eng¬ 
lish WordNet. For example, Spanish WordNet was constructed using English WordNet (Atserias, et 
al. 1997, Farreres, et al. 1998). They used a Spanish monolingual dictionary and bilingual dictionaries 
for the construction. They supposed three methods and combined those methods in order to get high 
coverage. For Korean, Moon (1996) used hypemym infonnation of a Korean dictionary and 
combined it with Korean translation of the English WordNet. The manual pruning was done 
during the construction. There was another research for the construction of Korean WordNet 
based on the English WordNet (Lee et al, 2000). They used a bilingual dictionary to link the 
senses of Korean nouns to the synsets of English WordNet. They applied six heuristics to 
word sense disambiguation and combined each heuristic with decision tree. 

This work is different from preceding works in three points; 1) our Korean noun thesaurus 
is a lexical map including hierarchy and other relationships, not a WordNet; 2) since it is 
based on a Japanese thesaurus and Korean MRD, we have an advantage of similarity between 
two languages; 3) we link lexical units, elements of thesaurus, with each sense of nouns in 
MRD. 

3 Main Method 

In order to select nouns to be included in a new thesaurus, we extracted nouns from large- 
scale corpus. The main method consists of three stages. la the first stage, semantic categories 
are assigned to the highly frequent nouns using the existing thesaurus. We call this Japanese 
existing thesaurus as “NTT thesaurus” in this paper. In the second and third stage, the de- 


' “Thesauri are based on concepts and they show relationships among terms. Relationships commonly expressed in a thesau¬ 
rus include hierarchy, equivalence (synonymy), and association or rclateilness.” [...] “WordNet structures concepts and 
terms not as hierarchies but as a network or a web” (Gail Hodge, 2000:6-7) 

2 Because the terms, “sense” and “signification” are differently treated by linguists, we clarify that the “sense” indicates the 
diverse realizations of “signification" corresponding to a lexical unit in this j;tudy. 



scriptive statements of each sense in the dictionary and hyperlink information of the MRD are 
used to extend the thesaurus. Then, we will explain the details of each stage. 

3.1 Selection of the Nouns 

It is difficult to deal with all nouns in MRD 3 . This is why we selected the highly frequent 
nouns, which will be linked to each semantic category of the thesaurus, from corpus. It is 
preferable that the thesaurus consists of the nouns, which appear frequently in corpus, for the 
purpose of high applicability of the thesaurus. In this paper, we selected highly frequent 
nouns from KAIST POS-tagged corpus 4 . And then, information for highly frequent nouns 
was extracted from a Korean dictionary (Hangeul society, ed. 1997). The highly frequent 
nouns covered about 91.1% of all nouns in that corpus. The statistics of the highly frequent 
nouns are as follows: 25,368 unique nouns whose total number of senses are 69,242; the av¬ 
erage number of senses per noun is 2.73. NTT thesaurus contains 2,710 hierarchical semantic. 
Our major task is to assign one of the 2,710 semantic categories to each of the 69,242 senses. 

3.2 Assignment of semantic categories 

In this stage, semantic categories are assigned to each sense of the highly frequent nouns with 
reference to the noun list of NTT thesaurus. First of all, we translated the noun list of NTT 
thesaurus into Korean using Japanese-Korean MT system. And then, experts correct the result 
of automatic translation. Finally, we arbitrate manually the cases of abnormal assignment 
between the languages. In spite of this process, the translation makes many problems; the 
most difficult problem is due to the difference of concept division system. For example, in 
Japanese thesaurus, words concerning “going” or “sorting” have more branches than in 
Korean language, and vice versa for word root. In addition, in the course of adapting a 
Japanese thesaurus to Korean language, we also find many problems. In NTT semantic 
category hierarchy, the number of word furniture is <895>. This word contains its hyponyms 
<desk 896>, < chair 897>... <fireplace 900>. For the Japanese, the word fireplace is 
understood as a sort of furniture, while the Korean treat this word as a part of the kitchen. 
These problems issue from the difference of thinking and culture. 

Then we can assign semantic categories by matching the highly frequent nouns to trans¬ 
lated noun list of NTT thesaurus. If we find a noun in the NTT noun list, we follow hypemym 
of the highly frequent noun and use that hypemym until the matching succeeds. Hypemym of 
the noun can be extracted automatically from the descriptive statements of the monolingual 
dictionary. Hypemym is considered as the head noun of the first noun phrase in the descrip¬ 
tive statement for the definition of the noun. We used some lexical patterns to find hypemym. 
However, this method ignores the sense of the noun itself, the candidates of semantic catego¬ 
ries are not assigned to the sense but to the noun. Moreover, there can be some errors in Ko¬ 
rean translation of Japanese thesaurus. In post-processing, word sense disambiguation was 
done manually to assign proper semantic categories to each sense of the noun and the transla¬ 
tion errors were also removed. 

Two people performed independently the same post-processing. The results of them were 
compared to each other and the only identical part of them was selected for the final semantic 
category to achieve the high accuracy. A third party examined the different parts of the re¬ 
sults and chose the proper ones. We assigned semantic categories to 29,637 senses for 19,663 
nouns in this stage. However, there was a little lower applicability than what we expected 
because of the translation errors and the discrepancy between the highly frequent nouns and 
the noun list of NTT thesaurus. The hypemym information was not enough to compensate 

3 Urimal Korean dictionary has about 300,000 nouns. 

4 It is composed of 10 million eojeols. 

48 



these defects. It makes us use other information additionally in the next stage for the higher 
applicability. 

3.3 Use of the definition in MRD 

In this stage, we use the result of the previous stage and the definition of the dictionary to 
expand the preliminary thesaurus in preceding section. 

33.1 Approach 

We used an information retrieval technique on the assumption that the senses, which are in 
the same semantic category, are defined by similar words in the dictionary (Chen 1998). For 
the senses to which we had assigned semantic categories in the previous stage, we clustered 
the definitions of the senses into semantic categories. A cluster of the definitions per semantic 
category was made. Each cluster corresponds to the document of IR and the definition of the 
sense corresponds to the query of IR. Assigning proper semantic categories to the sense can 
be viewed as retrieving relevant documents for the query. We have already assigned semantic 
categories to the part of senses in the previous stage so we can assign semantic categories to 
the rest of the nouns by this approach based on the previous results. 


3.3.2 Algorithm 


We must compute the similarity between the definition and the cluster to retrieve relevant 
clusters for that definition. We used simply tf (term frequency) and idf (inverted document 
frequency) to compute the similarity and gave an extra weight to hypemyms. The similarity 
between the definition Q and the cluster C, was computed by this equation. 




' N' 


t,eQ 


\ d fij 


t/. Content words which is in the definition Q 
g(tj): ftmction to give a weight for hypemym 

if tj is hypemym (w=2 is used) 
0 otherwise 


g(*j) “ 



The frequency of word tj in cluster C, 

N: The number of clusters 

dfj\ The number of clusters where word tj appears 


0 5 10 15 

Rank (n) 


We computed the similarity between the 

sense and each cluster and found most relevant cluster. And now, we could find proper se¬ 
mantic categories of that cluster. 


33.3 Experiment 

We clustered the definitions, to which the semantic :ategory had been assigned in the previ¬ 
ous stage, by the semantic category. We chose 150 senses, which are not yet involved in the 
assignment of semantic category, randomly as experiment data. Figure 1 shows the number 
of senses of which the relevant cluster appeared in higher /I th rank of the result. As n increases, 
the number of senses converges. When we gave 10 candidates per sense, the recall was 
61.3%. This result shows that this second stage is useful for extending the thesaurus. Post¬ 
processing is also required in this stage by the same method in 3.2. In this stage, we added 
19,263 senses for 9,006 nouns to the thesaurus. The Average sense number per noun, which 
was assigned in this stage, is bigger than that of the first stage. In other words, the method of 
the second stage is more useful in dealing with polysemy. 

3.4 Using the hyperlink information 


49 




In this stage, we use the hyperlink information to extend the thesaurus. Our structured version 
of Korean dictionary has hyperlink information such as synonym, abbreviation, antonym, etc. 
It is reasonable that the two senses, which are linked by this hyperlink information except 
antonym, belong to the same semantic category. So we expand the thesaurus using this prop¬ 
erties based on the previously accumulated results and hyperlink. The post-processing is not 
necessary in this stage. In this stage, we added 7,623 senses for 4,658 nouns to the thesaurus. 

The manual post-processing is the important 
part of this construction. We made an inte¬ 
grated browsing tool for the dictionary and 
the thesaurus. Figure 2 shows this browsing 
tool. This tool is constructed with WWW 
interface and divides into four frames for in¬ 
put, thesaurus browsing, MRD browsing and 
semantic category browsing. The four frames 
work functionally dependency to each other. 

5 Conclusion 

In this paper, we introduced a semi-automatic 
method for the construction of Korean noun 
thesaurus. This method uses a MRD and an existing thesaurus. The method consists of three 
stages. In the first stage, semantic categories are assigned to the highly frequent nouns using 
an existing thesaurus. In the second and third stage, the definitions of the dictionary and 
hyperlink information are used to expand the thesaurus. We constructed Korean noun thesau¬ 
rus, which has 56,523 senses for 23,823 nouns and 2,710 hierarchical semantic categories. 

Various applications to natural language processing systems are necessary for evaluation 
and feedbacks. And it is required to correct errors gradually and supplement this result with 
new nouns and senses as well as with other part-of-speeches. 

References 

Atserias, Jordi, et al. 1997. Combining Multiple Methods for the Automatic Construction of Multilin¬ 
gual WordNets. In Proc. of International Conference on Recent Advances in NLP. 

Chen, Jen Nan and Jason S. Chang. 1998. Topical Clustering of MRD Senses Based on Information 
Retrieval Techniques. Computational Linguistics Volume 24, Number 1. 

EuroWordNet home page. http://www.hum.uva.nl/~ewn 

Farreres, Xavier, et al. 1998. Using WordNet for building WordNets. In Proc. of COLING-ACL 
Workshop on Usage of WordNet in Natural Language Processing Systems. 

Hangeul Society, ed. 1997. Urimal Korean Unabridged Dictionary, Eomungag. 

Bcehara Satoru, et al. 1997. The Semantic System , volume 1 of Goi-Taikei - A Japanese Lexicon. Iwa- 
nami Shoten. 

Lee, Changki and Geunbae Lee. 1999. Using WordNet for the Automatic Construction of Korean 
Thesaurus. In Proc. of 11 th Hangeul & Korean Information Processing , pp. 156-163 (in Korean). 
Lee, Changki, et al. 2000. Automatic WordNet mapping using word sense disambiguation. In Proc. of 
Joint SIGDAT Conference on EMNLP/VLC. 

Miller, George A. 1990. WordNet: An on-line lexical database. International Journal of Lexicography. 
Moon, Yoo-jin. 1996. Design and Implementation of WordNet for Korean Nouns. Journal of KISS (c) 
2/4, pp. 437-445 (in Korean). 

! 

Korterm, Computer Science, KAIST, 373-1 Guseong Yuseong Daejeon 305-701 Korea (S.) 
Juho Lee, Koaunghi Un, Key-Sun Choi ( mywork, koaunghi. kschoil@world.kaist.ac.kr 


4 The Integrated Browsing Tool 



50 





A Tree-structure Solution for the Deve lopment of ChineseNet 

Liu Yang 
Yu Jiangsheng 
Yu Shiwen 

Abstract 

In this paper, we would like to put forh the notion of tree-structure in the 
development of a WordNet-compatible concept dictionary. After getting the full- hyponymy 
information in WordNet successfully, we have further implemented a visual tree-structure 
control which enables the lexicographers to operate interactively on the view of the 
hyponymy tree, with correspondingly automatic modifications of the database in the 
background. The expressing of semantics in the development thus adopts a much more 
intuitionistic and efficient way. ICL (the Institute of Computational Linguistics) now has 
benefited a lot by employing this new solution for ihe development of CCD (the Chinese 
Concept Dictionary), our ChineseNet, here in Peking University. 

Keywords Tree, Algorithm, Chinese, WordNet, CCD 

1 Introduction 

Nowadays NLP in Chinese is more focused on the processing of content information, 
such as Information Retrieval, Automatic Abstraction, Literature Classification and others, 
which needs a WordNet-compatible dictionary of Chinese concepts as the knowledge-base. 
On the other hand, the Chinese Concept Dictionary (CCD) is totally necessary for Word 
Sense Disambiguation (WSD) in the field of NLU anc MT. 

The Institute of Computational Linguistics (ICL), Peking University, with this point of 
view, has launched its ChineseNet project. The expectant CCD might be described as 
follows: it should carry the main relations already defined in WordNet with more or less 
updates to reflect the practice of Contemporary Chinese, and, it should be a bilingual one 
with the parallel Chinese-English concept pairs to be simultaneously presented within. Such 
a WordNet-compatible dictionary of Chinese concepts can largely meet our need of 
applications. 

Thus comes the radical issue on how to build the dictionary, or, in other words, on what 
is the proper solution for it. 

Answers may lie in the inherent structure of WordNet. As the relations that link synsets 
in the dictionary actually interweave into a huge lexical semantic net, say tens of thousands 
nodes, so what really counts in the development of such a WordNet-compatible dictionary is 
how to set up the relations properly and how to maintain the semantic consistencies in case of 
frequent occurrences of modifications. After analyses, we have come to believe that the 
difficulties of the development of the dictionary mainly result from these. 

Roughly there can exist three kinds of potential choices for the solution, a 
linear-structured one, a tree-structured one or a net-structured one. Let’s begin with the two 
ends. Naturally, a linear structure can hardly do fo:r the difficulties of carrying tremendous 



pieces of net-structured information during the period of development. To develop the 
dictionary directly on a large-scale, visual, net-structure control instead of merely the linear 
database may shed some light on this issue. However, due to the same problem of 
connatural complexity, of time and of storage, such a net-structure control has not appeared, 
and, perhaps, may never appear. 

In this paper, we would like to put forth the notion of tree structure in the development of 
the dictionary. 

Later on, we intend to give an illumination of the solution with emphasis on the basic 
ideas, though some of the algorithms may be inevitably touched on. The actual cases of the 
development practice in Peking University and the prospects of CCD will also be involved. 

2 Getting the Full-hyponymy Information in WordNet 

Obviously, the relation of hyponymy in WordNet is the most important relation among 
others. 

Since WordNet means to describe a kind of syntagmatic relations in lexicon, hyponymy, 
we may say, serves as the main frame to grasp the other relations such as antonymy, 
meronymy, troponymy, etc. During the period of development, only when synsets (to act as 
the basic units which also highly rely on the ontology of full-hyponymy) and hyponymy (to 
act as the basic relation) are well defined and realized, can the other relations be appropriately 
added into the lexicon. Likewise, during the period of utilization, the full-hyponymy 
information is totally valuable for the higher applications and the browser users. 

However, to extract the full hyponyms for a certain synset is by no means easy. As we 
have examined, the number of hyponyms for a synset ranges from 0 to 499 with a maximal 
hyponymy depth of 15 levels. This shows the shape of the potential full-hyponymy tree is 
quite unbalanced. Because of this, the ordinary searching algorithm can hardly do with the 
complexity of time and storage. If one inputs the word entity as an entry in WordNet 1.6 
and try to search its full hyponyms, he will get nothing but a note of “Search too large. 
Narrow search and try again.” provided that he does not narrow the searching by terminating 
it beforehand. Sure enough, if the entry is not entity but another word, say cat, the searching 
will probably do. The cases actually depend on the location of the entry word in the 
potential full-hyponymy tree in WordNet. The higher the level, the less possibility of 
success the searching will have. 

Before we can go on, we need to introduce the item of position which plays a pivotal role 
in the tree-structure solution. A position means the location of a certain node in the tree and 
it serves to organize the tree. For example, a position by the value “005001002” is to be 
representing such a location of a node in a tree: at the 1st level, its ancestor being the 5th; 
at the 2nd level, its ancestor being the 1st; and at the 3rd level, its ancestor viz. itself now 
being the 2nd. In fact, such an encoding does take an appearance of a linear string while 
expressing the full information of a tree-structure. This special kind of encoding makes all 
the tree-structure algorithms feasible. 


52 



Now let us demonstrate the searching algorithm for getting the full-hyponymy 
information in WordNet. By and large, it involves a series of the two-way scanning process 
and the gathering process, with each round of the process series intending to get the 
information of nodes on one same level in the tree. 

Suppose, in the I-th round of the process series, we have got the synsets La, Lj 2 , ..., Lu, 
with the positions “X001”, “X002”, ... , “XOOn” respectively. This implies that on the I-th 
level in the tree, there are n nodes'with the locations “X001”, “X002”, ... , “XOOn” 
respectively in the tree. Then we want to have the (I+l)-th round of the process series to get 
the information of the nodes on the (I+l)-th level in the tree. We have these synsets ordered 
by their offsets before we could do the two-way scanning process. In the scanning process, 
two pointers are set, one for the array of the above sorted synsets, the other for the 
corresponding DAT file to be compared with. During comparing, new positions on the 
(I+l)-th level are continuously generated according to the definition of position encoding. It 
is easy to prove that such a task can be done in an 0(N) time and an 0(n+n) storage, 
assuming N representing the number of records in the DAT file and n’ representing the 
number of synsets on the (H-l)-th level in the tree . After the two-way scanning, n synsets 
on the (I+l)-th level in the tree together with their respective positions can be got. In the 
gathering process, those synsets on the (I+l)-th level satisfying leaf-node condition is 
gathered while those sieved out are to be put into the next round of the process series. 

These process series are to be carried on repetitively till no new positions are generated 
on one round, which just means that the full-hyponymy information in WordNet has already 
been completely achieved. 

Also, it can be inferred that the number of the rounds equates to the depth of the 
full-hyponymy tree for a specific entry word, say 15 for entity, or 11 for food. Now we have 
got the full-hyponymy information in WordNet, ind, if one wants to view the tree, these 
pieces of information are ready for the tree-structured control through an operation of 
creating tree. 

By this special algorithm, the complexity of searching is greatly reduced. In our lab, a 
tapping of the top entry word entity on an ordinary PC means 100 or so seconds of waiting 
time with all the 45,148 synonyms generated. As for the ordinary entry word, say food with 
a total amount of 2,308 hyponyms, the algorithm is simply a real-time one. 

3 The Tree-structure and the Operations on It 

Following that, we have schemed a set of algorithms based on the existent Treeview 
Control in the Microsoft Visual Studio 6.0 and eventually implemented a new data-sensitive 
tree-structure control with 9 visualized operations on it. 

Apropos of the design of the algorithms for the operations, it is crucial that two sorts of 
critical consistencies should be especially maintained. One is that of the structural 
information of the foreground tree and the other is that of the semantic information of the 
background database. As these algorithms are too intricate to be presented here, in an 


53 



introductive paper, we would just list names of the 9 visualized operations below. 

0. to create a tree from a file; 

1. to new a brother node; 

2. to new a child node; 

3. to delete the current node (for one); 

4. to delete the current node (for all); 

5. to cut the current node; 

6. to copy the current node; 

7. to paste as brother nodes; 

8. to paste as children nodes. 

Among these operations, apart from that the No. 0 is to create a new tree from the 
external storage, the rest are all to edit the tree, with respectively the No. 1, 2 for addition, the 
No. 3, 4 for deletion, and the No. 5, 6, 7, 8 for batch movement. These operations have 
been carefully chosen to make them concise enough, capable enough and semantically 
meaningful enough. 

It is easy to prove that any facultative-shaped tree can be attained by iterative practice of 
these operations. 

4 The Development of ChineseNet in Peking University 

By now, we have got the fiill-hyponymy information in WordNet successfully through 
the special searching algorithm, and, it is just these pieces of information that will serve as 
the data bases for our future use. Also, we have got a visual, data-sensitive, tree-structure 
control with the above well-defined operations on it. Then we will go on to organize the 
full-hyponymy information, plus some other information relevant to the synsets, into a 
hyponymy tree. 

The lexicographers can now operate on the tree freely to express their intended lexical 
semantics, with correspondingly automatic modifications of the database in the background. 
It is of much significance that the lexicographers no longer need care for lots of details about 
the background database as they used to, for the foreground operations by themselves have 
already carried all the tree-structured lexical semantics and can deal with the net-structured 
lexical semantics with further work. 

Such is the outline of the more common solution for the development of a 
WordNet-compatible dictionary. Generally speaking, it has provided an easy approach to 
the evolution of the dictionary. 

However, when it comes to the development of ChineseNet, the cases can be a little more 
complicated. As we have mentioned in the introduction section, we wish that our CCD not 
only reflects the practice of Contemporary Chinese, but also should be a bilingual one. So, 
on the one hand, the development works by and large to the above-mentioned common 
solution with semantic operations to be done, on the other hand, going with each English 
synset information, the corresponding Chinese synset information is also to be presented. 


54 



To cope with the former problem, we have offered the 9 visualized operations. To cope with 
the latter problem, we would further add the datafields of the peer to peer Chinese items to 
the background database, and also add the Editbox Controls recording value of the Chinese 
items to the data-sensitive tree in the foreground. 

Thus a tool for the development of the bilingual dictionary CCD has come out as below. 
The interface view is showing the full-hyponymy tree for the entry word food, which is one 
of the 25 initial semantic units of nouns in WordNet with the category value of 13. 



|§|gj(2308 


0~ [food] [nutrient] 


[gastronomy] 


0— [fare] 

I [diet] 

ffl— [diet] 

;—[dietary] 

— [menu] 

j-— [chow] [chuck] [eats] [grub] 

!•— [board] [table] 

0— [ration] 

9— [fieldLration] 

[—■ [K _rntion] 

--- [C-ration] 

EB-- [foodstuff] [food_product] 

0- [nutriment] [nourishment] [sustenance] [aliment] [alimentation] [victuals] 
33-- [cookery] [cooking] [cuisine] [culinary_art] 

S3— [ccmnrissariot] [provisions] [provender] [viands] [victuals] 

33-- [feed] [provender] 

* - [rairaculous_food] [manna] [roanna_from_heaven] 

[beverage] [drink] [drinkable] [potable]__ 




ial|(victualsl 


any substance that can- be used as food 


jgjjgjjjSggjRB 




•1 Generator/ and'. Browser [CCD] 


5 Conclusions and Future Work 

Peking University has launched the ChineseNet project since September, 2000, and by 
now we have fulfiled 10,000 or so Chinese-English concept pairs. Due to the nice features 
of visualization and interaction of the tree-structure solution, we assuredly have benefited a 
lot by employing it for the development work. What is more, as the byproducts of these 
methods and experiences, we even have found seme faults of semantic expressing with 
WordNet 1.6, such as many occurrences of nodes with multi-father in the same category, 
improper locations of relational pointers in DAT files and others. 

In the long run, ICL wants to come to a total amount of 60,000 bilingual concept pairs 
which might largely meet our need of applications. 


Acknowledgement 

This work is a component of researches into Cliinese Information Extraction funded by 


55 
















No. 69483003, National Foundation of Natural Science and Project 985 in Peking University. 

We are expressly grateful to Prof. Wang Fengxin and Prof. Lu Chuan for unforgettable 
discussion and support. Thanks also to all the fellows who have participated in and 
collaborated on the work, among which we would like to mention Mr. Zhang Huarui, Ms. 
Song Chunyan, Mr. Liu Dong, Ms. Wang Wei, and others. 

References 

Beckwith, R., Miller, G. A. and Tengi, R. 1993. Design and Implementation of the 
WordNet Lexical Database and Searching Software. 

Cook, G. and Barbara, S. 1995. Principles & Practice in Applied Linguistics. Oxford: 
Oxford University Press. 

Cruse, D. Alan. 1986. Lexical Semantics. Cambridge and New York: Cambridge University 
Press. 

Fellbaum, C. 1993. English Verbs as a Semantic Net. 

Fellbaum, C., Gross, D. and Miller, K. 1993. Adjectives in WordNet. 

Fellbaum, C. 1999. WordNet: An Electronic Lexical Database. Cambridge, Mass.: MIT 
Press. 

Huang, C.R. et al. 2001. Linguistic Tests for Chinese Lexical Semantic Relations: 
Methodology and Implications. Report in the Second Workshop on Chinese Lexical 
Semantics, Beijing, 2001. 

Keil, F. C. 1979. Semantic and Conceptual Development: An Ontological Perspective. 
Cambridge, Mass.: Harvard University Press. 

Lyons, John. 1977. Semantics , 2 vols. London and New York: Cambridge University Press. 
Miller, G. A. 1993. Noun in WordNet: A Lexical Inheritance System. 

Miller, G. A. et al. 1993. Introduction to WordNet: An On-line Lexical Database. 

Touretzky, D. S. 1986. The Mathematics of Inheritance Systems. Los Altos, Calif.: Morgan 
Kaufmann 

Vossen, P. (ed). 1998. EuroWordNet: A Multilingual Database with Lexical Semantic 
Networks. Dordrecht: Kluwer. 

Liu Yang 

Email: liuyang@pku.edu.cn 
Yu Jiangsheng 
Email: yujs@pku.edu.cn 
Yu Shiwen 

Email: yusw@pku.edu.cn 

Institute of Computational Linguistics, Dept, of CS 
Peking University 
Beijing 100871, P. R. China 


56 



Experiences in building the Indo WordNet — 
A WordNet for Hindi 


Debasri Chakrabarti 
Dipak Kumar Narayan 
Prabhakar Pandey 
Pushpak Bhattacharyya 

Abstract 

It is increasingly being understood by the practioners of infor¬ 
mation retrieval, natural language processing and knowledge engineers 
that a rich lexical knowledge base is the heart of any intelligent in¬ 
formation processing system. The words of a language are extremely 
powerful units that bind together in extensive and unique ways to 
create a knowledge web. Indo WordNet - as we call the WordNet for 
Hindi - is an on-line lexical database. It is an attempt to build a lexical 
reference system for Hindi language. The design has been inspired by 
the famous English WordNet. For each word we find the synonym set, 
representing one lexical concept. These synonym sets are linked with 
other synonym sets through the well-known semantic relationships of 
hypernymy, hyponymy, meronymy, holonymy, antonymy and so on. 

The Indo WordNet has some unique features like graded antonymy 
and meronymy. It also addresses the unique Indian language phe¬ 
nomenon like causative, compound and conjunct verbs, both at the 
conceptual level and the implementation. It has an efficient underly¬ 
ing database design. The web interface for querying the Indo WordNet 
has been implemented using Php4 scripting language. The data entry 
interface, implemented using Java/Jfc is also simple and elegant. 

Keywords: WordNet, Semantic Relations, Hindi language, Lexical database 


1 Introduction 

The Indo WordNet is a system for bringing together different lexical re¬ 
lationships between the Hindi words. It organizes the lexical information in 
terms of word meanings. It can be said to be a dictionary based on psy- 
cholinguistic principles [Fel98], In any WordNet each word is organized into 
synonym sets or synsets. This is done to remove ambiguity in cases where a 
single word has multiple meanings. Thus, the synset serves as an identify¬ 
ing definition of lexical concepts. For example, the two synsets {TT, WTTT} 
{house, room}, *pT} {house, home} can serve as an unambiguous dif¬ 
ferentiator of these two meanings of {^nr} {house} [Var98] [KK97]. 



The WordNet is actually a word sense network. A word sense node in 
this network is a synset which is regarded as a basic object in the WordNet. 
All word sense nodes are linked by various semantic relations, such as syn¬ 
onymy, IS-A (hypernymy/hyponymy), PART-OF (meronymy/holonymy) 
etc. These relations serve to organize the lexical knowledge base. 

2 Semantic Relations 

We have implemented all the common semantic relations like synonymy, 
hypernymy, hyponymy, meronymy, holonymy, troponymy, entailment etc. 
[Fel98j, [JNPB01]. The concept of gradation is implemented for the noun, 
adjective, verb and adverb categories. Compounding of verbs is a typical 
Indian language phenomenon. This along with the structure of verbs and 
their categorization have been incorporated in the Indo WordNet. 

The English WordNet describes seven types of part-whole relationship 
and have implemented three types of relations e.g. component-object, member- 
collection and stuff-object. Euro WordNet [Vos99] has limited the part-whole 
relations to five types e.g. component-object, portion and the whole from 
which it has been detached, a place and a wider place which includes it, a 
set and their members and a thing and the substance it is made of. The 
Indo WordNet implements part-whole relationship in eight different ways as 
shown in Table 2. 

We have implemented three types of antonymy relations, viz , gradable 
antonym {ipt- {hot - cold}, complementary antonym {uftfacT- *J?T, 
{alive-dead}, and converse antonym {<^TT-%TT} {give-take}. 


Table 1: Antonyms can be formed on the basis of the following criteria 


Size 


big - small, thick - thin 

Quality 

JTT, T= 9TT- 

good - bad, love - hatred 

Gender 

kzr- *rraT- fW 

son - daughter, father - mother 

State 


begining - end 

Personality 


Rama - Ravana, David - Goliath 

Direction 

'jf- Tfw, arnf- 

east - west, front - behind 

Action 


give - take, buy - sell 

Amount 


little - much, light - heavy 

Place 

fT- W 

far - near 

Time 

f^T- <ld, ^ 5TPT 

day - night, morning - evening 


58 





























Table 2: Meronymy Relationship 


Component-object 

HPTT- srftr 

head - body 

Member-Collection 


tree - jungle 

Feat ur e- Act i vity 


lecture - ceremony 

Place-Area 

HlTcT 

Delhi - India 

Phase-State 

'Ji a t *T) - 

youth - age 

Portion-mass 


lump - clay 

Resource-process 


pen - writing 

Position-Area 

P?>cHRPr>cti i 

doctor - medical treatment 


2.1 Gradation 

Gradation is also a type of semantic relation. The Indo WordNet deals 
elaborately with the interpretation of Gradation in different domains as 
shown in Table 3. Gradation is possible only when there are two oppo¬ 
site terms, for example , in the diagram shown in Fig 1, the two terms A 
and B which are in contrast with each other, open a possibility for other 
terms which may fall between them. We can c<dl A and B as positive and 
negative respectively between which there is a neutral term, which cam be 
called Intermediate-term. Intermediate-term is the core of gradation. In the 
example shown in Fig 1, {^TR"} {damp} is the Intermediate-term which is 
the blend of {^W} {dry} as well as {ifTHl } {wet}. A term which may also 
occur between (+1) and (0), is called Pre-Intermediate term which has been 
denoted by (+1/2), for example {^p^T} {arid}. On the otherhand, there is a 
term between (0) and (-1) called Post-Intermediate term, like {dT} {moist}. 
Here it is noteworthy that Pre and Post Intermediate terms can also be 
antonymous to each other, as in { arid-moist }, but the antonym 

of Intermediate term is not possible. 

Figure 1: Gradation levels 


A(+l) 

(+1/2) 

(0) 

(-1/2) 

B(-l) 




cTT 

jfterr 


i 

| 

4- 



Pre-Intermediate 

1 

Post-Intermediate 



Intermediate 


59 
































Table 3: Gradation Relationship 


State 

vjjq|*n , <£<21 HI 

childhood, youth, old-age 

Size 

5T?T, rfaHT, 

large, medium, small 

Light 

a^MI, <1 

bright, foggy, dark 

Gender 


male, eunuch, female 

Temperature 


hot, lukewarm, cold 

Colour 

J fl 

fair, wheatish, dark 

Time 

fsr, ifr^f^r, w 

day, dusk, night 

Quality 


good, ordinary, bad 

Action 

iy-rr, ^hhi 

sleep, drowsy, awake 

Manner 

?r, itotit nft- 

speedly, medium, slowly 


2.2 Verbs in the-Indo WordNet 

Hindi verbs are varied in nature. The Indo WordNet tries to give much 
semantic and syntactic information about the Hindi verbs. An effort has 
been made to examine the entire verb lexicon by classifying them according 
to their structure and semantic domain [Sin85] (Bah97). 

2.2.1 Simple Verb {TO ^ fw} 

Simple verbs are the verbs made of one root verb. For example , {^TRT} 
{to eat}, {RHi } {to sleep}. 

2.2.2 Compound Verb Bhd) } 

These verbs are made up from either another verb or other part-of-speech. 
Accordingly they can be classified as 

• Verbs made with Nouns : for example, {STTT^T 3vCdT} {to begin}, 
{STflT } {to bear a blow}. 

• Verbs made with Adjectives : for example , FPTdT} {to taste 

sweet}, {dcdvi f[tdT} {to be bom}. 

• Verbs made with Adverbs : for example , {|TT {to go away}, 

^rr} {to go forward}. 

• Verbs made with Verbs : for example , {^TT ^d»dl} {to have eaten}, 
{\JfT %3TT} {to go and si£}. 


60 




























2.2.3 Combination Verb {^TPTTf^T P*R|} 

These verbs are made up of two such root verbs which are related to each 
other from the point of view of meaning. For example, {T<£HI - } {to 

read and write}, {^"RT-ffaT} {to eat and drink}. 

2.2.4 Onomatopoeic Verb I rH faRT} 

These verbs are made from the Onomatopoeic words. For example, fo<HddH l} 
{to knock} is derived from the sound and {to hum} is 

made from the sound }. 

2.2.5 Conjunct Verb {^Tt(1 1 f^n*} 

These are made up of two root verbs where the sense of {^»T} {action} is 
hidden. For example, {$T ^THT} {to take away } it means {H“ STRT} {to 
take away and go}. 

2.3 Causative Verbs {^rpfar fer*n"} 

A verb that denotes causing something to happen is a causative verb. Ac¬ 
cording to the structure, causative verbs are of two types, one is the {5P-TT 
^ } first causative verb and the other is the {f^ffar 

second causative verb [Sin85]. For example , {^Hl} {to make some¬ 
body sleep}, l-i l } {to make somebody sleep through the effort of third 
person }. {*TT Tift* ^}^{the mother is putting the baby to sleep}, 

{rfr ^^ ^t* tJldi ^RT Tf) %■} {the mother is having Sita to put the 
baby to sleep}. 

This theory of verbs has been implemented in the Indo WordNet by in¬ 
troducing three new semantic relations called compound, causative, and con¬ 
junct. Thus, for example , there is a bidirectional compound relation between 
{ t RRT} {to read} and {Rh*Hi } {to write}, a unidirectional causative relation 
from } {to make somebody sleep} to } {to make somebody 

sleep through the effort of third person} and a bidirectional conjunct relation 
between {fC} {far} - {^dHi} {to move}. 

3 Database 

The design of the Indo WordNet is based on the fact that basic relations 
or lexical links are between synonym sets. The lexical database is stored in 
MySQL database [MySOO]. The synsets and the semantic links between 
them are grouped according to different grammatical categories like noun, 
adjective, verb, adverb and are stored in separate tables. For all categories 
the word and its polysemy are stored in the table tbLsynseLdata. Table 4 
gives the description of the database tables. Each column heading gives the 


61 



Table 4: Tables in the Database 


tbl synset.dat a 

tbl_noun_data 


tbl_adj_data 

tbl_adv_data 

synsetJd 

synsetJd 

synsetJd 

synsetJd 

synsetJd 

head_word 

hypernyms 

hypernyms 

antonyms 

antonyms 

synset 

hyponym 

troponyms 

similar 

derived Jrom 

gloss 

meronyms 

entailments 

attributes 

gradation 

onto_nodeJd 

antonyms 

gradation 

gradation 


category 

gradation 

causative 





compound 





conjunct 




table-name in the database and the values below them give the field-names 
in that table. 

4 Conclusions, Observations and Results 

We have described in this paper our experiences with the building of a 
WordNet for Hindi which we call the Indo WordNet. After thoroughly ex¬ 
amining the unique features of Hindi as a language, we have incorporated 
these features in the Indo WordNet. All the basic semantic relations like syn¬ 
onymy, hypernymy, hyponymy, meronymy, holonymy, troponymy, entailment 
etc. have been included. Additionally, the Indo WordNet gives finer shades of 
antonymy and meronymy relations. Compounding is a typical phenomenon 
of the Indian languages. Different types of nominal and verbal compounds 
are found in most of the Indian languages. The Indo WordNet implements 
compounding which is useful for the automatic analysis of Hindi. 

The system also attempts to integrate Ontology by linking synsets to 
the ontological nodes. The integration of lexicon, WordNet and Ontology 
is expected to create an extremely rich and vast lexical resource which will 
support the extensive work on automatic analysis and generation of Hindi 
going on in IIT Bombay. This is consistent with the current thinking on solv¬ 
ing the difficult problems like word sense disambiguation in natural language 
understanding, machine translation and information retrieval systems. 

The data entry graphical user interface has been designed using Java/Jfc. 
It provides a user friendly method to insert the synsets and the semantic rela¬ 
tions into the database. We have implemented an organizer utility to normal¬ 
ize and verify the consistency of the database. We have used ISCII font 1 for 

1 http://www. tdil.gov.in/standards.htm 


62 









































data storage and DV-TT-Yogesh font 2 for the display of data in Devanagari 
text. The web interface to the Indo WordNet has been implemented. The 
site address is http://vmjw.cse.iitb.ac.in/meru/wn/webhwn/. The web inter¬ 
face uses Php4-0 scripting language to process the search query. At present 
the Indo WordNet contains over 7000 synsets. The distribution according to 
the POS is 60% noun, 25% adjective, 10% verb, 5% adverb synsets. 

When completely implemented the Indo WordNet will be a mile stone in 
the computational processing of Indian languages. The basic design and im¬ 
plementation strategy can be applied to almost all Indian languages because 

. •««..* - * • • 

' «... ——o ... Uu»u.u. 


Acknowledgement 

This research is supported by a grant from Ministry of Information Tech¬ 
nology, India, New Delhi. 


Authors 

Debasri Chakrabarti, Dipak Kumar Narayan, Prabhakar Pandey and Push- 
pak Bhattacharyya 3 , Computer Science and Engineering Department, Indian 
institute of Technology - Bombay, India. 


References 

[Bah97] Hardev Vyavaharik Hindi Vyakaran Tatha Rachna. Lokb- 

harti Prakashan, Allahabad, India, 1997. 

[p c I^j CLiiaticuxo Pollbaum, editor. WordNet: An Electronic Lexical 
Database. The MIT Press, 1998. 

[JNPB01] Sanjay Kumar Jha, Dipak K. Narayan, Prabhakar Pandey, and 
Pushpak Bhattacharya. A Wordnet for Hindi. In Workshop on 
Lexical Resourses in Natural Language Processing. Hyderabad, In¬ 
dia, January 2001. 

[KK97] Arvind Kumar and Kusum Kumar. Samantar Kosh Vol. 1 and 2. 
National Book Trust, India, 1997. 

[MySOO] MySQL. Database for Linux, http://www.mysql.com/, 2000. 

2 http://www. mit. gov. in/hindi/loadfonts. htm 
3 pb@cse.iitb.ac.in 


63 



[Sin85] Suraj Bhan Singh. Hindi ka Vakyatmaka Vyakarana. Sahitya 
Sahakar, Delhi, Delhi, India, 1985. 

[Var98] Ramchandra Varma. Pramanika Hindi Kosh. Lokabharati 
Prakashan, Allahabad, 1998. 

[Vos99j Piek Vossen, editor. EuroWordNet General Document. University 
of Amsterdam, http://www.hum.uva.nl/~ewn, 1999. 


64 



Tamil WordNet 


Devi Poongulhali P 
Kavitha Noel N 
Preeda Lakshmi R 
A. Manavazhahan 
T. V Geetha 


Abstract 

This paper on ‘Tamil Wordnet’ presents the design and implementation issues involved in creating a 
lexical database for Tamil language. The infrastructure of the Tamil Wordnet differs from its standard 
prototypes, to accommodate the unique features and specialties that are characteristic of Tamil 
language The linguistic aspects of Tamil dictate the design of the Wordnet. The implementation, 
details such as the design of the lexicographic files, database tables, grinder utility, etc have been 
discussed in the course of the paper. An application to demonstrate the use of Tamil Wordnet has also 
been looked up on. 

Introduction 

WordNet is an online lexical reference system. Word forms in WordNet are represented in their 
familiar orthography; word meanings are represented by synonym sets (synset) - lists of synonymous 
word forms that are interchangeable in some context. Two kinds of relations are recognized: lexical 
and semantic, exical relations hold between word forms; semantic relations hold between word 
meanings [1]. 

The source files of WordNet are written by lexicographers. These files are the product of a detailed 
relational analysis of lexical semantics (a variety of lex 'cal and semantic relations are used to 
represent the organization of lexical knowledge) of the particular language. Each synset consists of a 
list of synonymous words or collocations and pointers that describe the relations between this synset 
and other synsets. These relations include (but are not limited to) hypemymy / hyponymy, antonymy, 
entailmcnt, and meronymy / holonymy. A word or collocation may appear in more than one synset, 
and in more than one part of speech. Each use of a word in a synset represents a sense of that word in 
the part of speech ™Tcspo!:Uing t0 synset [2]. 

Wordnet in Tamil 

The design and implementation of Wordnet is specific to the uniqueness of the language for which it 
is developed. 

Among the many languages prevalent in thv world, Tamil is one of the very few languages whose 
roots date back to more than two thousand years but is still a popularly used vernacular not only in 
India but also in many other nations like Malaysia, Singapore, Srilanka, etc. The grammatical 
complexity and structure of Tamil can be compared to that of ancient languages like Greek and Latin. 

Developing a complete lexical database for such a vast language would require a great amount of 
research. The language specific peculiarities that need to be handled are Font issues. Cultural issues, 
Morphological issues & Free word order and verb structures 

Most other Wordnets that have been developed already are based on the UNIX platform. [ This would 
involve using ISCII format to represent the characters specific to the language or using special drivers 
for handling the fonts. In order to avoid these complexities we have opted to work in WINDOWS 
platform using the ASCII format (Font: TAB-ANNA).Tamil Wordnet is implemented in JAVA, as it 
allows the use of Unicode font format. 



The Tamil lexicographers have chosen to use many categories to represent the words in Tamil. This 
would enable us to handle the complexity of the language in a more efficient manner. For example in 
order to tackle the large number of tradition and culture specific words in Tamil, a new category 
based on tradition has been introduced. In this category are included words that depict temple 
functions, pooja items, festivals etc... This categorization may he applicable to other Indian languages 
also. 

A set of morphological functions are applied in order to extract form of the search word and 

further processing is based only on this root form. Unlike English, Tamil is a morphologically rich 
language and hence extraction of root form ffom the derived words is complex. Nouns combine with 
plural markers and case markers to form the derived form. Some nouns can combine with case 
markers only after conversion to oblique form. In Tamil vwtz °ocur m * CQmpo^ite form. In addition 

to Person, Gender, and Number markers that require agreement with .. Vfc *~ s combine with 

aspect and modal auxiliaries. Special markers indicate adjectives and adverb -, in famii. ilrnce special 
functions are required to extract the root word. A few other Sandhi rules are also * j be YacfcSed c-nier 
to extract out the root form. The part of speech determiner has to handle all the aspects mentioned 
above. 


Free Word Order and Verb Structure 

In English & English like languages a fixed format exists for a sentence structure ( SVO-Subject + 
Verb + Object).Tamil follows a free word order and especially in written form the only constraint is 
that verbs normally occur at the end and all other phrasal structures can occur in any position. This 
permits greater flexibility in the language. In free word order the position of the noun and verb are 
demarcated by the use of case markers for nouns and PNG markers for verbs. However in most cases 
Tamil follows a SOV pattern and we indicate Verb frair.?s in terms of SOV structure as an 
underlying pattern. In the context of Tamil it is understood that the change of position of all phrasal 
structures except verb is allowed. 

Tamil WordNet design and implementation 

The main components of Wordnet include Lexicography the Grinder, Z;t.noase files and User 
Interface. The structure and design of these components depend unon the implementation details. 
Standard imolementation uses file systems to depict the lexicogidphic and ^* e grinder 

utility converts the lexicographic files into database files. The interface search^., these ut*—tiics 
to output the required data. In ou- implementation the lexicographic and the database rues are 
represented in the form of relational database tables[2]. We have also added an input interface for the 
lexicographers. 


Lexicographic Files 

Lexical hierarchy of the words in Wordnet is stored th- ’cxic-W **- ^es. The categories and 
hierarchy specified in these files are language specific, and are developed by geographers. 


The Parts of Speech (POS) in Tamil language include ‘participle’ (Idaic-hol) m • . in > 

verb, adjective and adverb. The words belonging to different Pane of Speech are classified into 
separate lexical files. The classification in each POS is as follows. 


NOUN 

Due to the plentitude of nouns present in the Tamil language, lexicographers have chosen to 
categorize nouns based on 12 unique beginners. Each unique beginner has its oy.~ Hierarchy of words. 
The words in each hierarchy are in turn organized into separate files. The unique beginners and the 
hierarchy under them are given below. 


66 



1. PORUL- Entity(3) 


• Uyarthinai - Rationals 
o Manithar - Human 
o Theivam - God 

o Naragar - Demon 

• Agrinai - Irrationals 

o Uyirullavai - Living things 

■ Vilanga- Animal 

■ Neervaivana - Aquatics 

■ Oorvana- Reptiles 

■ Poochigal-Insects . 

■ Paravaigal -Birds 

■ Thavaram -Plants 

o Uyir artravai - Non living things 

■ Eiyarkai-Nature 

• Eiyarkai_porul - Natural things 

• Nigalvu - Occurance 

■ Seyarkai - Artificial 

• Kalam - Time 

• Idam - Place 

• Alavugal - Measurements 

o Ennuppeyar - Countable 
o Alavaippeyar - Measurable 

• Urpaththi - Product 
o Thidam - Solid 

o Thiravam - Liquid 
o Vayu - Air 

2. PANPAADU - Tradition 

• Nigalchi - Event 

• Panpaattupporul - Traditional things 

• Seyal - Action(Verbal nouns) 

3. THURAI - Occupation 

4. PANBU - Property 

• Vadivam-Shape 

• Vannam-Color 

• Thanmai-Charecter 

5. URAVU - Relation 

6. THODARBU - Communication 

7. UNAVU - Food 

• Thidam- Solid 

• Thiravam-Liquid 

8.SEYAL - Action(Verbal noun) 



9.URUPPU - Part 


10. KULU - Group 

11. UNARVU-Feeling 

12. NILAI-State 

• Nadappu nilai - Current Status 

• Thaguthi - Status 

Verb 

it 

The Tamil lexicographers have chosen to divide verbs into two basic categories namely ‘Action’ 
verbs and ‘State’ verbs [4]. The lexical files for verbs are based on these categories. The classification 
is as follows. 


Verb 



Action verb State verb 
(Seyal) (Nilai) 


L Action Verb 

1. Movement verbs (asaivu/eykkam) 

2. Tranfer verbs ( edam mattru venai) 

3. Impact verbs (oru seyalin velaivu) 

4. Communication verbs (thodarpu) 

5. Association verbs(kalaththal) 

6. Dynamic verbs(vegam sartha) 

II. State Verb 

1. Psychological verbs (mana unarvu) 

(mental state verbs) 

2. Emotion verbs (oonarchi) 

3. Perception verbs(kandunarthal) 

4. Existential verbs (velippattu vinai) 

5. Process verbs (nadappu seyal) 

ADJECTIVES AND ADVERBS 

In Tamil language adjectives and adverbs are mostly identified using special markers. In most cases 
the root word(noun/verb) can be extracted from the words using morphological rules. These 
adjectives/adverbs cannot be stored in the lexical files due to their huge number (Example Alagana - 
beautiful, Vegamaha -hurriedly). Hence a search on these words in Tamil Wordnet would use the 
base form to answer the query. Those words which cannot be subjected to morphological rules are 
represented in separate files(one each for adjective and adverb)(Eg sila - few, pala - many). 

Participle 

A separate category of words used to link the noun and verb forms are given as participle (Idaicchol) 
in Tamil. The different types of participles used in Tamil are case markers, verbal participle, 


68 



interrogative letters, demonstrative letters, participle of ‘um’, etc. As these words cannot be 
represented with either noun or verb, they stand as a category by themselves (Eg. Udan, Pondra). 

Design of Tables 

The words of the language are classified according to Parti of speech (nouns, verbs, adjectives and 
adverbs) to incorporate the differences in the word forms. The design of each lexicographic table 
varies to account for the semantic differences of the word form. Further each classification is 
subdivided into categories depending on the 'unique Tops'. .Separate tables are used to represent each 
of these sub-categories. 

The words are arranged in the form of synsets(lists of synonymous word forms that are 
interchangeable in some context). In the tables each synset has a unique synset id which is used as the 
primary key. To account for the different senses of the same word ( eg: ’padi ’ -steps, ’padi’ -measuring 
instrument ) in the same parts of speech , an incremental sense_number is given to each of its 
occurrence in the same lexical table .Each entry of the synset is thus a unique sense in that part of 
speech. 

The relationship among the synsets in the intra and inter categories ( the different levels of the 
Wordnet tree) are portrayed using pointers(links between tables/records).These links (relations) are 
represented in the form of field entries containing set of related words. We have portrayed the entire 
Wordnet database in the form of upward-(hypemym, meronym, etc)/horizontal-(antonym, attributes, 
etc) links only. The other relations namely holonym, hyponym, etc are derived from these already 
existing links. This is done basically to minimize the size of the database in use and at the same time 
increase efficiency of the search function of the grinder utility. 

For example, if 'pennchingam’-‘lioness' is the word, its parent (hypemym) in the Wordnet tree is 
'chingam’-iion'. So in the table entry for 'lioness', 'lion' is specified but not vice-versa. A gloss for 
each of the synset is also specified in the table. 

In addition, the verb tables will.have verb structures that is sentence structures specified. The relations 
specified for verbs are semantic based. The number of adjectives and adverbs used in Tamil is limited 
and can be classified based on the nature of usage. 

Design of Grinder 

The Grinder uses database connectivity to access the relational database tables (JDBC has been used 
with Microsoft Access). Using the lexical tables, the grinder utility generates 4 database tables (one 
for each of the parts of speech). 

Construction of the database tables from the lexical table involves following the links provided in the 
lexical table. As seen before only the horizontal and upward links are stored in the lexical table. 
Hence the grinder can generate the horizontal and upward relations directly from the links in the 
lexical table. The other relation’s synsets ( eg . downward relation’s synsets) have to be obtained by 
backtracking these links. The resulting database tables have ;ill the relations specified straightaway for 
each of the synsets. 

For example, ‘pennchingam’-'Iioness 1 will have the synset id of ‘chingam’-'lion'( with a particular 
sense number) in the 'hypemym' field and ’!ion'(with the same sense number) also will have the 
synset_id of’lioness' specified in the 'hyponym' field. 

Then these database tables are input to the user interface program. The interface looks up for the 
word(root form of the search word) in all the 4 tables. It gets the synsetjds for the different relations 
from the tables and then outputs the different senses and the other relations (hypemym, holonym, 
hyponym, meronym, antonym, etc).The gloss and verbstructure are also displayed. 


69 



Interface for Tamil Lexicographers 


In order to enable the additions to the lexicographic tables and to facilitate easy data entry an interface 
is provided for the lexicographers. This helps in adding new synsets to the lexicographic tables. Here 
also only the upward relations (hypemyra, meronym, etc) need to be specified. The interface will 
generate unique synset ids to identify these newly added synsets. If it has more than one sense in the 
same part of speech then an unique sense number for that particular synset would be generated. The 
interface program then adds this record to the end of the corresponding table (has to be specified) 
when the input is verified by the data-entry personnel. 

After the database is updated the grinder utility is rerun to reflect the changes in the database tables. 
The user interface can now query for these added synsets also. This enables the expansion of the 
vocabulary of the lexicographic tables incrementally. 

Applications 

Lexicons are indispensable resources for almost every natural language project. Systems performing 
word sense disambiguation, information extraction or retrieval, prepositional attachment, 
interpretation of nominalizations, textual summarization, co-reference resolution, abductive 
reasoning, conversational implicature, recognition of textual cohesion and coherence, intelligent 
Internet searches and some of the digital libraries projects use WordNet. It can also be used in 
designing crossword generators. 

These applications query the Wordnet to obtain different relationships among words. This information 
is interleaved and is used according to the application in question. 

Wordnet provides information about the usage of words in different senses. This feature of Wordnet 
can be used in cross word generators. After the grid is filled up with appropriate words, Wordnet can 
be used to provide the clues. This application would require extensive use of gloss and verb structures 
incorporated in Wordnet. The relation among words and their glosses can also be used in clue- 
formation. Since Wordnet provides an extensive database the user will be provided with a variety of 
choices to choose the most appropriate clue. The generation of crosswords in Tamil is a challenging 
task due to the combination of consonant and vowel forming a single letter to be considered in the 
crossword. 

Conclusions 

We have developed a functional Wordnet model which is capable of handling the different features 
required by the Tamil language. However the content of the database has been covered only for a 
selected application dependent domain. The interface provided for the lexicographers is to ensure 
content growth in future to cover the entire Tamil language. 

An additional feature included in Tamil Wordnet is the introduction of new categories in the Wordnet 
tree in order to cover the aspects that are specific to the Tamil language and culture. With this Tamil 
Wordnet as a backbone we are planning to develop a cross word generator for Tamil. 

Current Size & Coverage Aspects of Tamil Wordnet 

We have taken Sports as our domain for the current development. This domain covers approximately 
the following number of words. 


70 



300 nouns 
200 verbs 
100 adjectives 
160 adverbs 
50 participles 


1810 total number of words 


References 

Christiane Fellbaum. (1.999) WordNet-An electronic lexical database , MIT Press,second edition. 
George Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller. (1990) 
Introduction to WordNet: An On-line Lexical Database , International Journal of Lexicography, 3(4) 
pp 235-312. 

Tolkappiar. (1995) Tolkappiam-Porulatikaram , General Editor - Dr.A.Chidambaranatha Chettiar, 
second edition. 

Vasu R. (1998) Case System in Modem Tamil. 


Preeda Lakshmi. R, School of Computer Science and Engineering, College of Engineering. Guindy, Anna 
University, Chennai-600 025, Dreedalakshmi@hotmail.com 

Devi Poongulhali. P, School of Computer Science and Engineering, College of Engineering, Guindy, Anna 
University, Chennai-600 025, deviDoonauzhali@hotmail.com 

Kavitha Noel. N, School of Computer Science and Engineeiing, College of Engineering, Guindy, Anna 
University, Chennai-600 025, kavithanoel@hotmail.com 

A. Manavazhahan, Project Associate, School of Computer Science and Engineering, College of Engineering, 
Guindy, Anna University, Chennai-600 025. 

T.V.Geetha, Professor, Project Associate, School of Computer Science and Engineering, College of Engineering, 
Guindy, Anna University, Chennai-600 025. 




Word Sense Disambiguation Using Semantic Graph 

Narayanan Unny E 
Pushpak Bhattacharyya 


Abstract 


This work describes a method of word sense disambiguation by finding similar words in a 
text. We have used some characteristic properties of the text and its constituent words for the 
disambiguation task. Using the WordNet, the algorithm constructs a semantic structure on 
the text illustrating the relations among the words of the text. This structure is then used for 
disambiguating the constituent words. 


1 Introduction 

The problem of word sense disambiguation has been a classical problem in the field of 
natural language processing. The solution to this problem can in turn solve problems in 
machine translation, information retrieval, automatic hyperlinking and other such 
applications involving natural language processing. 

The popular approaches to word sense disambiguation can be categorized as (a) 
supervised (Gale et al., 1992), (Ng and Lee, 1996) (b) unsupervised (Yarowsky, 1995), 
(Resnik, 1997) and (c) machine readable dictionary based (Cowie et al., 1992), (Miller et al., 
1994), (Agirre and Rigau, 1995). In the first case, a correspondence between the sense of a 
word and its context of usage is established using a training corpus consisting of labeled 
senses of a word along with its usage context. Though the supervised approach achieves a 
high rate of accuracy, it is limited to only a small subset of the vocabulary. This subset is 
decided by the words present in the documents of the training corpus. On the other hand the 
unsupervised approach has a wider coverage but at a lower accuracy. 

Our work falls under the third category where the WordNet (George A. Miller, 1990) is 
used. Due to the well-structured organization of WordNet, the disambiguation algorithms 
using it have been able to attain high degrees of accuracy. 

The novelty in our work is the use of lexical chain (Morris J. and Hirst G., 1991) like 
structure. Though many algorithms that perform hyperlinking (Green S, 1999) using lexical 
chains claim implicit word sense disambiguation, there is perhaps no algorithm that uses 
properties of lexical chains in explicit sense disambiguation of words. Our algorithm involves 
the disambiguation of words by building a directed acyclic graph (DAG) structure over the 
words in the text and using the properties of this structure to disambiguate the words in the 
text. 

2 WordNet as a hypergraph 

WordNet can be defined in simple terms as a network of concepts. Sets of synonyms form the 
building blocks. This set of synonyms is termed the synset. These synsets are linked to each 
other by a group of semantic relations called hypernymy, hyponymy, meronymy, holonymy 
and antonymy. Another view of the WordNet could be that of a hypergraph with synsets 
forming the nodes and the relations forming the arcs. In this structure, we find that the nodes, 
which lie close to each other, are more related than the words that lie far apart. The shortest 
distance between two synsets in terms of the link distance separating them is known as the 
conceptual distance (Rada et al., 1989). Lesser the conceptual distance between two synsets, 
more similar they are. As a result, we can conclude that the similarity of two synsets in the 



WordNet graph is inversely proportional to the link distance separating them. This 
observation forms the basis of our algorithm. 

3 Text Structure 

In this section, we look at some unique properties of a text that can be exploited to aid the 
word sense disambiguation process. 

A typical text consists of a flow of ideas that are expressed in the form of structured 
sentences. These ideas are expressed by a series of words, the words being placed in such a 
way that the flow of the idea is clear. Depending on the ideas expressed in a text, one can 
classify the text as belonging to a certain topic. A human reader performs this task of 
classifying a text by judging the priorities given by the author to each of the ideas expressed 
in the text. 

The words conveying the main ideas in a text are commonly known as the keywords in the 
text. Although nouns and verbs play important roles in expressing the ideas, we feel that the 
nouns carry the main burden of expressing. We hypothesize that, given the topic of a text, 
there is a high probability that most of the words in the text are very similar to the topic. For 
instance, in a text about Car we expect to find a lot of words related to car, like steering , 
wheels etc. This condition of similarity implies that most of the synsets corresponding to the 
words in a text would lie close to each other when mapped on to the WordNet graph. We can 
then impose this condition as a constraint of proximity in the WordNet that a particular word 
from the text must satisfy. This type of constraint imposed on the senses of a group of 
contiguous words to disambiguate them, has been used in many algorithms. The first 
algorithm that used such a condition was the algorithm by Michael Sussna (1993). In this, he 
considers a group of contiguous words in a window. The sense combination for the words in 
the window is chosen such that the sum of distances between the synsets corresponding to the 
word forms and word sense combinations is minimized. 

Another algorithm that considers this proximity of synsets is the one by Eneko Agirre and 
German Rigau (1995). In this, they define a measure of conceptual density as the number of 
words in the context window that lie in the sub hierarchy of synsets of the candidate word. 
The synset of a word is chosen such that the conceptmil density is maximized. 

The main drawback of these algorithms is that they assume that words lying close to each 
other in the text have similar senses. This might not be necessarily true. It has also been 
observed that the reference of a word can extend up to a large distance in the text. Use of 
anaphors is a good example of such a case. Hence, though we can be sure that the text is 
populated with similar words pertaining to the main topic of the text, we cannot make any 
such assumption about the contiguous words in the text. The proof of such a conclusion lies 
in the result obtained by Eneko Agirre and German Rigau. In the evaluation of their 
algorithm it was seen that the precision of disambiguation increased with the increase in the 
window size used for disambiguation. This indicates that the constraint of proximity in the 
WordNet graph works better as the context size increases. 

Another flaw in the above description of a text structure is that a text can talk about more 
than one topic. We must take this factor into consideration when designing WSD algorithms. 
The idea of capturing the flow of ideas in the text has been previously used in lexical chains. 
The aim there was automatic hyperlinking (Green S., 1999). In our algorithm we use the idea 
of lexical chains to obtain more effective disambiguation. 

The discussion so far can be summarized by an illustration of how the words in a text can 
represent the ideas in a text when they are mapped onto the synsets of the WordNet graph. 


73 




WotdNri Graph 


o 



In figure 1, the large circle represents the WordNet graph. The smaller circles with a 
number of points inside them correspond to the clusters formed by putting together similar 
words in the text. The points that we find in them are the synsets corresponding to the words 
in the text. Each cluster stands for an idea that has been expressed in the text. 

When the words are not disambiguated there is more than one image of it in the WordNet 
graph. If our previous assumption of clustering is true then we can choose the image of the 
word in such a way that we get a closely packed cluster. The question now is - where do we 
start to cluster a text. This is the place where the property of monosems helps us. 
Monosemous words have only a single sense and therefore it serves as the right starting point 
for clustering. 

Our algorithm uses the principles discussed above to produce a cluster of words in the 
text, and in the process disambiguate them. 


4 Semantic Graph 

The algorithm works by building a Directed Acyclic Graph (DAG) like semantic structure 
from the words in the text. We call this structure as the Semantic Graph of the text. In this 
semantic graph, the nodes correspond to the different senses of the words in a text. The arcs 
between them represent the dependency between the nodes. Here the arcs are directed. The 
direction denotes the dependency - child is dependent on the parent. Furthermore each node 
has a score associated with it. The magnitude of this score determines the sense of the word. 
The score at a node corresponds to the probability of a word taking that particular sense. 

5 The Algorithm 

The algorithms that use lexical chains to capture the flow of idea usually start with the first 
word of the text. With this word as the starting point, other words in the text are added to the 
lexical chain that is most similar to the candidate word. The similarity is measured in terms of 
its link distance in the WordNet graph. The problem with such an approach is that we do not 
know a-priori the sense of the first word. At this point monosems are used. Monosems have 
only a single sense and hence do not need any disambiguation. Hence we choose the 
monosems in the text as the starting point for the chain building process. 


74 



The individual steps in the algorithm that integrates all these concepts discussed above is 
given below. 

1. Collect the monosems: In this initial step we collect all the monosems in the given text. 
These form the roots of the semantic DAG that gets built as the algorithm proceeds. For a 
single word in the text we have all the senses of the word as nodes in the DAG. The number 
of roots of the DAG becomes equal to the number of monosems in the text. 

2. Initialize the scores of monosems: For each node in the DAG, it has a score 
corresponding to the probability of the word taking that sense. To start with, the roots of the 
DAG are monosems. These are initialized with a score of 1. 

3. Find the link distance to other words: By the deiinition of a node in the DAG we can 
recognize that each node is the synset corresponding to a word in the text. We take a word 
from the text that has not been added to the DAG yet and for all synsets corresponding to the 
word, we find the link distance between the synsets coiTesponding to the current node of the 
DAG and the selected word from the text. This involves searching the WordNet graph 
starting from one of the synsets and proceeding in a breadth first fashion till we reach the 
other synset. When we talk about the links here, it must be noted that the reference is to all 
kinds of relation links in the WordNet. To decrease the lime and space complexity caused due 
to a breadth first search we restrict the search to a cut-off radius say, search_depth. The 
value of this variable decides the depth to which the search proceeds. Any link distance that 
is more than the search_depth is taken as infinite. Placing such a restriction on the depth of 
the search has an advantage that the precision of disambiguation also increases. 
Search_Depth can control the value of the similarity assigned to pairs of synsets. 

4. Form the semantic DAG: After finding the distance to the synsets of a word we compare 
the distances that we have obtained. The word along with its sense, to which a path has been 
found, is added as the child of the DAG node. Two situations arise in this case: 

• The new DAG node is not present. In this case the node is created and added as a child of 
the current DAG node. 

• The new DAG node is already present. This condition can happen when the combination of 
the word and the sense corresponding to the new DAG node had already been found by a 
sibling of the current DAG node. In such a case a pointer is added to the existing DAG 
node making it the child of the current DAG node. 

From the way the DAG is constructed we find that it is done in a breadth first fashion. That 
is, the frontier of the DAG is expanded one level at a time. After a particular level has been 
added to the dag, it goes onto expanding the next level. When a DAG node is chosen as the 
current node, the word corresponding to the current selection is permanently removed from 
the list of words in the text. Therefore a node corresponding to a particular word at a 
particular location in the text can occur only once in the DAG. This is to avoid cycles. Note 
that, this does not prevent the same word occurring at different points in the text, from 
occurring more than once in the DAG at different levels. 

5. Pass the score of the parent to the child: Having initialized the scores of the roots to 1, 
we need to assign scores to the subsequent nodes at each level. We make a small deviation 
from our previous notation and let us denote a node at level i and sense j as Wj . The score 

of Wj will depend on the following three parameters, 


75 



• Score of parent : Since a DAG node is added based on its proximity to the synset of the 

corresponding parent DAG node, the probability of the node depends on the probability 
associated with its parent word taking the corresponding sense. Hence, the probability of 
Wj taking the sense j depends upon the probability of its parent W l k ~ x taking the sense k . 
Therefore, 

ScoreiWj ) « Score(Wf l ) 

* • Distance between the child and the parent : The score of the DAG node also depends on the 
distance between the parent word and itself in the text. The intuition behind this is that 
similar words are distributed quite close to each other. This phenomenon is evident in texts 
having multiple paragraphs. Mostly, the topic of discussion changes during the transition 
from one paragraph to another. Similar words can be found inside a paragraph rather than 
between paragraphs. Therefore, 

Score( 'Wj)° c f Dist (w‘ >W ‘-') 

• Link distance in the WordNet ; The score of a word also depends on the link distance of the 
parent word with the current word in the DAG. The lesser the distance between them, the 
more similar the child node is to the parent node. We are then more certain about the sense 
of the child node. The score then is 

S c °re{W])- y Unk _ dist(wl ^ 


where, 

Linkjdist is the numerical value of the distance between the parent and child nodes in 
terms of the number of links separating them in the WordNet graph. If this distance is larger 
than the parameter search_depth then linkjdist is taken as infinite. 

Hence, when we combine all these measures together into a score we would get, 

„ , ScoreQVf 1 ) 

ScoreiWj) - , Unk _dist{W] ^_ 

Since the structure is a DAG, a node can have more than one parent. The aggregate score of 
the node then becomes the sum total of the scores contributed by each of its parents. 


6. Judge the sense for a particular word: Having assigned the scores to all the nodes of the 
DAG, we compare the scores of all the senses of a word. We choose that sense of the word, 
which has got the highest score. It must be noted that, since we remove the word from the 
text after having added it to the DAG, we will find all the senses of a particular word 1 at the 
same level of the DAG. The scores that we have assigned to the different senses of the word 
are not normalized, so it can happen that some senses have zero score. This serves a purpose, 
as we will discuss later. 

The DAG structure that we had produced during the algorithm is nothing but a subgraph 
of the WordNet. This subgraph contains most of the words in the text. In the created unique 
DAG structure, one property is that as we descend down the hierarchy of nodes at each level, 
the accuracy of disambiguation decreases. This is because of the cascading effect of errors in 
the disambiguation process. To start with, we were sure about the sense of the monosems that 
occupied the root, but as the score was passed from one level to another, the certainty about 
the proportionality between the score and the sense decreased. This means that if there was 
any error in a level and a wrong sense of a word got a high score then this error can pass on to 
the descendants of the node and cause errors in them. Hence one technique to improve 
precision is to maintain a cut-off at some level of the DAG. All the levels below the cut-off 

1 Here, word refers to the word at a particular location of the text and not to a particular word form. 


76 



.level can be excluded from the disambiguation procedure. In the experiments given below we 
have not used any level cut-offs. 

6 Algorithm with an Example 

We will demonstrate each step of the algorithm using an example of a text document. We 
use a text from the Semcor corpus. The extract is given below. 

Nevertheless , ”we feel that in the future Fulton jCounty should receive some portion of these 
available funds ”, the jurors said, n Failure to do this will continue to place a disproportionate 
burden ” on Fulton taxpayers. The jury also commented on the Fulton ordinary’s court which has 
been under fire for its practices in the appointment of appraisers . guardians and administrators 
and the awarding o f fees and compensation. 

In the above extract of the text the underlined words are the nouns in the text and most of 
them have multiple senses (for example, portion has 6 WordNet senses under the noun 
category). 



Figure 2 Partial DAG structure 

(Step 1) Collect the monosems: Some of the monosems in the text under consideration 
would be funds, jurors, guardians, awarding, and taxpayers. 

(Step 2) Initialize the scores of monosems: The words funds, jurors, guardians, awarding 
and taxpayers will form the roots of the DAG and will have a score of 1. 


(Step 3) Find the link distance to other words: From each leaf of the DAG, link distance is 
found to the words in the text. For example, a link distance of 2 is found from the sense no. 1 
of funds to the sense no. 4 of portion. Hence, sense no. 4 of portion is added as a child of 
funds. 

(Step 4) Form the semantic DAG: The above step is repeatedly performed to grow the 
DAG structure. A part of the DAG structure is shown in figure 2. 

In figure 2, monosems occupy the roots of the DAG. The three numbers given in brackets 
along with the words represent the position of the wo:'d in the text, sense number and the link 
distance to that sense from the parent node respective y. 


(Step 5) Pass the score of the parent to the child: ITie nodes of the DAG are then assigned 
scores based on the relation, 


Score(fV‘) = 


I 


_ Score'JVf' ) _ 

Dist(W\, W? ) * link _ dist{W ), W? ) 


where the summation is over all the parents W l ~ l of W *. The DAG structure with the scores 
assigned to the nodes is shown in figure 3. 


77 




Figure 3 DAG structure with scores 

Figure 3 illustrates the final structure of the DAG that we had built. The numbers assigned 
in brackets to each of the node in the DAG represents the index of the word in the text, the 
sense number and the score of that particular node. For example, for the node portion the 
index is 1, sense 4 and the score is 0.5. 

(Step 6) Judge the sense for a particular word: The sense for a word is chosen such that 
the node corresponding to that sense in the DAG has the highest score. 

For example, in figure 3 we find that the word portion at 1 st position of the text has 
different scores for its different senses. We choose the sense of portion which has the highest 
score. In this case it corresponds to the 4 th sense of portion. Though the actual sense of 
portion is 3, but 4 is found to be close to the 3 rd sense. The algorithm hence is found to 
narrow down the choice from 6 senses to two in this case. Also we find a good example of 
disambiguation in the case of court which has 9 senses. 

7 Evaluation 

The evaluation of this algorithm was done using three different parameters namely precision 2 , 
recall 3 , and coverage 4 . For our experiments, we chose documents from the Semcof corpus 
(Miller et al., 1993) in which the words have been tagged with sense, part of speech and other 
lexical information. 

7.1 Disambiguation with a single sense (fine level) 

In our first experiment, we chose the semcor corpus br-aOl. The reason for this is that 
previous works on disambiguation of nouns like Eneko Agirre’s were also evaluated using 
the same document. From this text we extracted out the nouns and these nouns without any 
sense tagging were passed to our algorithm as input. The output obtained is the same set of 
nouns with the senses. The senses are then compared with the actual senses of these nouns. 
The performance is then computed from the counts we obtain. We had earlier discussed that 
the scores assigned to the senses are not normalized and hence it can happen that, for a word 
all its senses have zero score. In such a case the word is not disambiguated. We performed 
the experiment runs for three different values of the search_depth. The results are 
summarized below. 


Search Depth 

Precision 

Recall 

Coverage 

3 

62.87 

32.30 

51.38 

4 

57.50 

42.46 

73.84 

5 

51.91 

45.84 

88.30 


Table 1 Results of the experiment 


2 Precision is defined as the ratio of number of words disambiguated correctly, to the number of words disambiguated in 
total. 

3 Recall is defined as the ratio of number of words correctly disambiguated to the total number of words that were input to 
the algorithm. 

4 Coverage is defined as the ratio of number of disambiguated words to the total number of words that were input to the 
algorithm. 


78 





.From the results we find a neat variation in the precision and recall values with the values of 
the search_depth parameter. This variation of precision can be explained by the fact 
search__depth restricts the distance upto which two s>msets can be considered to be similar. If 
we increase the value of this parameter, even synsets that are far apart are considered similar, 
resulting in a dilution of the similarity measure. T3ais in turn affects the precision of the 
disambiguation. 


7.2 Disambiguation at a coarse level 

It is a known fact that in the WordNet the sense distinction between the words is very fine. 
Due to this, an ordinary human user may not be able to disambiguate a word as correctly as 
had been done by experts in the Semcor documents. To test this, we modified our algorithm 
in such a way as to produce two senses of the word which gave the highest and the second 
highest scores. We mark a word as correctly disambiguated if any of the two senses produced 
matches the actual sense given in the Semcor tagging. 

The results of this expe riment appear in table below. _ 


Search Depth 

Precision 

Recall 

Coverage 

3 

67.66 

34.76 

51.38 

4 

66.66 

49.23 

! 73.84 

5 

i 66.20 

58.46 

88.30 


Table 2 Results of the second experiment 


The results here too follow the pattern of the first experiment, but the values of precision 
and recall are better. 


13 Comparison of results 

The results of comparison of our experiments with ihe algorithm using conceptual density 



Precision 

Recall 


Conceptual 

density 

47.30 

39.40 

83.20 

Coarse 

evaluation 

67.66 

34.76 

51.38 

Fine 

evaluation 

62.87 

32.31 

51.38 


Table 3 Comparison of differed algorithms 


The comparison is based on the evaluation performed on the br-aOl text of Semcor. We 
find from the table that the performance of our algorithm is very good The reason for this is 
that we use the whole text in building the DAG and hence the context in which a word is 
judged, is not limited to a set of neighboring words. As far as the efficiency is concerned, we 
have a trade-off between the efficiency and the recall. As we increase the value of 
search_depth parameter, the efficiency in terms of time and space decreases but the 
performance in terms of the recall increases and precision decreases. 


8 Conclusion 

We would like to conclude by noting that the sense disambiguation relies heavily on the 
context of word in the text and must include as much of the whole text as possible. 


79 
















The algorithm that has been introduced in the paper attempts to find the senses of words in 
the text by relating them with the words in the entire text rather than some specific set of 
words forming the window of words. 

Another observation that we can make is the variation of the precision and recall values by 
varying the search_depth parameter. This variable gives the user of the algorithm control 
over its performance. 

9 Further Work 

Due to the large number of nouns present in a typical text and due to the breadth first search 
processes for finding the links between synsets, the efficiency of the algorithm is quite poor. 
We will reduce these overheads by splitting the text into segments and then trying to 
disambiguate the segments one at a time. 

Another possible improvement is to modify the scoring to reflect the transition of the topic 
across paragraphs. Normally in a text, there is a shade of transition in the topic across the 
paragraphs. This information can be integrated into the disambiguation algorithm so that the 
score contributed by a DAG node built from a word in one paragraph is not propagated to a 
word in the succeeding paragraphs. 


References 


Cowie J., Guthrie L., and Guthrie J. 1992. Lexical Disambiguation Using Simulated Annealing. Proceedings of 
the Fifth International Conference on Computational Linguistics COLING-92. 157-161. 

Gneko Agirre and German Rigau. 1995. A Proposal for Word Sense Disambiguation Using Conceptual 
Distance. Proceedings of the First International Conference on Recent Advances in NLP. Bulgaria 

Gale W., Church K., and Yarowsky D. 1992. One Sense per Discourse Proceedings of the DARPA Speech and 
Natural Language Workshop. Harriman. NewYork. 

Green, S. 1999. Building Hypertext Links by Computing Semantic Similarity , IEEE Transactions on Knowledge 
and Data Engineering, Vol. 11-5,713-730. 

Rada R., Milli H., Bicknell E. and Blettner M. 1989, Development and Application of a Metric on Semantic 
Nets, in IEEE transactions on Systems, Man and Cybernetics, Vol. 19, no.l, 17-30. 

Michael Sussna. 1993. Word Sense Disambiguation for Free-text Indexing Using a Massive Semantic Network. 
Proceedings of the Second International Conference on Information and Knowledge Management. 
Arlington,Virginia USA. 

Miller G. A. 1990. Five Papers on WordNet. Special Issue of International Journal of Lexicography 3(4). 

Miller G. A., Leacock C., Randee T., and Bunker R. 1993. A Semantic Concordance. Proceedings of the 3 rd 
DARPA Workshop on Human Language Technology. 303-308. Plainsboro. New Jersey. 

Miller G. A., Chodorow M., Landes S., Leacock C. and Thomas R. G. 1994. Using a Semantic Concordance for 
Sense Identification. Proceedings of the ARP A Human Language Technology Workshop. 240-243. 

Morris J. and Hirst G. 1991. Lexical Cohesion, the Thesaurus, and the Structure of Text. Computational 
Linguistics. Vol. 17. No. 1.211-232 

Ng H. T. and Lee H.B. 1996. Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An 
Exemplar-Based Approach. Proceedings of the 34 th Annual Meeting of the Association of Computational 
Linguistics (ACL-96). Santa Cruz 

Resnik P. 1997. Selectional Preference and Sense Disambiguation. Proceedings of ACL Siglex Workshop on 
Tagging Text with Lexical Semantics, Why, What and How? Washington. 

Yarowsky D. 1995. Unsupervised Word Sense Disambiguation rivaling Supervised Methods. Proceedings of the 
33 rd Association of Computational Linguistics. 


Narayanan Unny E, IIT, Bombay, i. 

Pushpak Bhattacharyya, IIT, Bombay, pb@cse.litb,ac,jn 


80 



Validity of Noun Semantic Networks 
For Korean Word-Sense Disambiguation 


Yoo-Jin Moon 
Kyongho Min 


Abstract 

This paper presents the method to verify validity of Korean noun semantic networks that 
are used for the construction of the selectional restriction relation by applying the networks to 
the syntactic and semantic properties. In addition, this paper utilizes the integrated Korean 
noun and verb networks for word-sense disambiguation in the Korean sentences , through the 
selectional restriction relation in the sentences. Integration of Korean Noun Networks into the 
SENKOV system will provide the accurate and efficient knowledge base for the semantic 
analysis of Korean NLP. 

1 Introduction 

Korean has quite a lot of polysemous words compared to the other languages, by reading 
Chinese characters phonetically. Thus WSD ('word sense disambiguation) in Korean has been 
one of the most popular research themes in NLP and information retrieval. In order to solve 
the problem, semantic networks for verbs and nouns have contributed as knowledge bases . If 
the semantic networks for verbs and nouns are combiaed effectively, they will play a great 
role in solving the WSD problem. 

There are several kinds of semantic networks for nouns and verbs — WordNet, Levin verb 
classes and VerbNet, German WordNet, Korean Noun WordNet and SENKOV(Semantic 
Networks for Korean Verbs) in Korea etc. (Levin(1997), Miller etc.(1993), Moon(1996)). 
There have been controversies about their validity and difficulties to prove it. 

It has been a difficult task to prove that semantic networks have been built with valid 
hierarchical classes and the semantic networks work for the semantic analysis of sentences 
properly. It is why the networks are based on dictionaiies, concept, recognition and heuristic 
methods. (Hemert(1994), Levin(1997), Montemagni aad Vanderwende(1992), Moon(1996)). 
This paper presents the method to verify validity of the semantic networks that are used for 
the construction of the selectional restriction relation by applying the networks to the 
syntactic and semantic properties. In addition, this paper utilizes the integrated Korean noun 
and verb networks for word-sense disambiguation in the Korean sentences, through the 
selectional restriction relation (subjects, objects and predicates etc.) in the sentences. 

2 Literature Review 

2.1 English WordNet and Levin Verb Classes 

WordNet (Miller etc.(1993)) is an on-line lexical reference system whose design is inspired 
by current psycholinguistic theories of human lexical memory. English nouns, verbs, adverbs 



and adjectives are implemented in terms of synonym sets. Each synonym set represents one 
underlying lexical concept. WordNet presently contains about 120,000 word forms. WordNet 
may be viewed as the semantic networks which represent hypernyms of English word senses 
in the form of isa-hierarchies. 

Levin verb classes (Levin(1997), Levin and Hovav(1996)) contain various syntactically 
relevant and semantically coherent English verb classes. It takes a semantic classification 
structure and incorporates the syntactic relationship into the semantic relationship for verbs. 
Levin classifies approximately 3,000 verbs into 49 verb classes and the verb class groups 
meaningfully relate verbs together. However, there is little hierarchical organization compared 
to the number of classes identified. 

2.2 Semantic Networks for Korean Verb 

SENKOV system (Moon(1999)) classifies about 700.Korean verbs into 46 verb classes by 
meaning. It has been implemented on the basis of the definition in a Korean dictionary, with 
top nodes of Levin verb classes, hierarchies of WordNet and heuristics. It attempts to 
incorporate syntactic relation into the semantic relation for Korean verbs, and distinguishes 
the intransitive verb from the transitive verb. Moon(1997) proves validity of SENKOV verb 
classes by applying them to the selectional restrictions among adverbs and verbs in the 
sentence. 

2.3 Korean Noun Networks 

Korean Noun Networks are sets of isa hierarchies for Korean nouns. (Moon(1996)). The isa 
hierarchies consist of nodes and edges. The nodes represent synonym sets of Korean nouns 
and English WordNet. And the edges represent hypemymous relations among nodes. 

In this paper, Korean Noun Networks are utilized to automatically extract sets of 
hypemymous concepts. 

2.4 Validity of the Verb Semantic Networks to the Selectional Restriction Relation 

Most of the verb semantic networks have been classified on the basis of the semantic 
properties of the verbs. In order to verify valid hierarchical classification of the networks that 
are used for the selectional restriction relation, Moon(2001) considered the syntactic 
properties as well as the semantic properties of the verbs. That is, the verbs belonging to the 
same verb class should share the syntactic and semantic properties. 

3 Verifying Validity of Noun Semantic Networks 

There are many factors to evaluate the validity of the noun semantic networks in NLP. One 
of the evaluating factors is whether they have valid hierarchical classes for the selectional 
restriction relation. The valid hierachical classes of the noun semantic networks will provide 


82 



proper selectional restriction relations that are required for correct semantic analysis of 
sentences. In this section, the noun semantic networks are considered from the viewpoint of 
the selectional restriction relation in order to utilize the networks for word-sense 
disambiguation. 

3.1 Validity of English Noun WordNet for Korean 

From the viewpoint of the selectional restriction relation, the validity of English Noun 
WordNet to Korean hierarchical classes is studied. 

For example, a predicate ^(eat)" in the Korean sentence “^5] 7} {Mary} 

(apple) ^^(eat) - Mary eats an apple” takes the argument of an object, which is a kind 
of food. English Noun WordNet contains a synset "food, comestible, edible" which can be 
used as the selectional restriction of the predicate "^^(eat)". For example, Figure 1 
illustrates one of the IS-A hierarchies for "water" in English Noun WordNet. 


1st sense of water 
water 

=> food, comestible, edible 
=> substance, matter 
=> object, inanimate object 
=> entity 


Figure 1. IS-A hierarchies of "water" in English WordNet 

A predicate ^(drink)" in the Korean sentence ^ 7}(Mary) #-§•(water) D } A 1 
{drink) - Mary drinks water” takes the argument of an object which is a kind of liquid stuff 
such as beverage and drinking water. However, English Noun WordNet doesn't contain a 
synset "beverage, drinking water" which is one of the selectional restrictions of the predicate " 
°M4(drink)". 

Thus it is not possible for the English Noun WordNet to cover the selectional restriction 
relation in Korean properly because English Noun WordNet does not contain all the valid 
hierarchical classes for Korean WSD. 

3.2 Validity of Korean Noun WordNet for Korean 


83 




1st sense of IK water) 
water fir, -g-S.^r) 

=> beverage, drinking water 
=> food, comestible, edible 
=> substance, matter 
=> object, inanimate object 
=> entity 


Figure 2. IS-A hierarchies of (water) H in Korean WordNet 

From the viewpoint of the selectional restriction relation, the validity of Korean Noun 
WordNet to Korean hierarchical classes is studied. 

A predicate ^(eat)" in Korean takes the argument of an object, which is a kind of food. 
Korean Noun WordNet contains a synset "food, comestible, edible" which can be used as the 
selectional restriction of the predicate "^^(eat)" in Korean. For example, Figure 2 illustrates 
one of the IS-A hierarchies for "water" in Korean Noun WordNet. 

A predicate ^-(drink)" in Korean takes the argument of an object which is a kind of 
liquid stuff such as beverage and drinking water. Korean Noun WordNet contains a synset 
"beverage, drinking water" which is one of the selectional restrictions of the predicate 
4(drink)". 

By similarly applying Korean Noun WordNet to the selectional restriction relation, we 
know that Korean Noun WordNet contains the valid hierarchical classes to the selectional 
restriction relation in Korean. 

4 Integration of Semantic Networks For Word-Sense Disambiguation 
Integration of semantic networks will contribute to the semantic analysis of NLP and 
speech recognition. This paper tries to resolve the word-sense disambiguation in machine 
translation utilizing integration of noun semantic networks into verb semantic networks for 
Korean. Figure 3 illustrates a part of Database for Integration of Semantic Networks (DISNet). 


84 




9.1 3 Cj (hang, stake, run) 

[POS] [transitive verb) 

[SYN] (S+V+O+L) 

[SUBCAT]: 

[S - nc 1.2.1 (person, individual, human) 
nc 1.2.2 (animal, animate being, brute) 

V - hang, stake, run 

0 - nc 1.3 (object, inanimate object, thing) 

(Eng., hang -*• nc 1.3 + prep. + L) 

- ^-^(life) (Eng., run'+ a risk) 

- nc 7.5.3 (money and other possessions, medium of exchange) 

(Eng. stake + nc 7.5.3) 

- nc 2.3.2.8.11 (telephone, telephonys) (Eng. call) 

L - nc 5.6 (location) ] 

9.1 : SENKOV verb class 9 

POS : part of speech 

SYN: syntactic structure 

SUB CAT : subcategorization information 

nc : hierarchical class of Korean Noun Networks 

S: subject, V: verb, 0: object. L: location 

Eng. : English 

*) Values of SYN and SUBCAT are collected from corpus. 

Figure 3. A Part of DISNet 

Figure 3 describes a part of DISNet. SENKOV verb class 9.1 contains the verb "^^(hang, 
stake, run) ". The verb has three slots and their corresponding values as follows: 

POS(part-of speech) : a transitive verb 

SYN (synonym set) : S+V+O+L 

SUBCAT (subcategorization) : [S - nc 1.2.1 (person, individual, human) ...], 

where the values of the SUBCAT are integrated with the hierarchical classes 
of Korean Noun Networks. 

That is, the selectional restriction of the subject for the verb is 'person, individual, 

human’ (noun class 1.2.1) or 'animal, animate being, brute' (noun class 1.2.2), and that of the 
object is 'life' or 'object, inanimate object, thing' (noun class 1.3) etc. Values of SUBCAT are 


85 





collected from corpus (Shin(1999), Roland(2000), Gonzalo and Verdejo(2000)) and mapped 
to Korean Noun Networks. 

For example, the verb 53 (the past form of cf ’’(hang, stake, run)) in the Korean 
sentence “^f'8' 0 ] ^ might be translated into the English word “hung” 

rather than either “staked” or “ran”. 

(S: owner) ^ (L: on the wall) 3- ^ -§r(0: picture) ^ $3 ^(V: ?). 

The predicate of the sentence can be translated into one of three English verbs — hang, 
stake, run. The object of the sentence is "picture" which belongs to the noun class 1.3. 
According to DISNet in Figure 3, the predicate of the sentence may be translated into the 
English verb "hang." Thus the above sentence is translated into "The owner hung a picture on 
the wall." 

However, the verb in the following Korean sentence might be translated into 

“stake”. 

IM'MteCS: grandmother) ^Hr°l](on the evens) .2. SW ■&■((): 50,000 won) ^53 
4(V: ?). 

The predicate of the sentence can be translated into one of three English verbs — hang, 
stake, run. The object of the sentence is "50,000 won 15 " which belongs to the noun class 7.5.3. 
According to DISNet in Figure 3, the predicate of the sentence may be translated into the 
English verb "stake." The above sentence is translated into "A grandmother staked 50,000 
won on the evens." 

DISNet provides the accurate and efficient knowledge base for word-sense disambiguation. 
Also, DISNet can play an important role in both computational linguistic applications and 
psycholinguistic models of language processing. 

Applicable areas of DISNet are disambiguation of nouns and verbs for NLP and machine 
translation, writing aid, speech recognition, conversation understanding, the abridged 
sentences, human-computer interface, and extraction of co-occurrence information and 
structure information in the information retrieval and summary. 

5 Conclusions 

It has been a difficult task to prove that semantic networks have been built with valid 
hierarchical classes and the networks work for the semantic analysis of sentences properly. It 
is why the networks are based on dictionaries, concept, recognition and heuristic methods. 
This paper presented the method to verify validity of Korean noun semantic networks by 

11 unit of Korean currency 


86 



applying the networks to the syntactic and semantic properties. 

■In addition, this paper utilizes the integrated Korean noun and verb networks for word-sense 
disambiguation in the Korean sentences, through the selectional restriction relation in the 
sentences. 

The presented method can be utilized to prove validity of the semantic networks for the 
other languages. And integration of Korean Noun Networks into the SENKOV system will 
provide the accurate and efficient knowledge base for the, semantic analysis of NLP. 

Future works are to extend the DISNet to all of the Korean verbs and to apply them to NLP. 


6 References 

Hemert P. (1994) "KASSYS : A Definition Acquisition System in Natural Language," Proc. of COLING-94, pp. 
263-267. 

Levin B. (1997) English Verb Classes and Alterations : A Preliminary Investigation, The MIT Press. 

Levin B. and Hovav M. (1996) Unaccusativity: At the Syntax-Lexical Semantics Interface, The MIT Press. 

Miller G. A, Beckwith K, Fellbaum C., Gross D. and Miller K (1993) "Introduction to WordNet: An On-line 
Lexical Database," in Five Papers on WordNet, CSL Report, Cognitive Science Laboratory, Princeton 
University. 

Montemagni S. and Vanderwende L. (1992) "Structural Patterns vs. String Patterns for Extracting Semantic 
Information from Dictionaries," Proc. of COLING-92, pp. 546-552. 

Moon Y. (1996) Design and Implementation of Korean Noun WordNet Based on the Semantic Word Concept, Ph. 
D. Thesis, Dept, of Computer Engr., Seoul National Univ. 

Moon Y. (1999) "Design and Implementation of SENKOV System and Its Application to the Selectional 
Restriction", Proc. of the Workshop MAL in NLPRS, pp.81-84. 

Moon Y. (2001) "Construction of Semantic Networks for the Language Information Processing”, Proceedings of 
International Symposium on Advanced Intelligent Systems, pp.42-46. 

Nyberg E. etc. (1998) The KANT Translation System: From R&JD to Large-Scale Deployment, LISA Newsletter, 
vol.2:1. 

Shin Joong-Ho etc. (1999) "Verb Classification Using the Clustering Technique", Proc. of Korean Cognitive 
Science Society. 

Roland D. (2000) "Verb Subcategorization Frequency Differences between Business-News and Balanced 
Corpora : The Role of Verb Sense", Proc. of the Workshop and Comparing Corpora in ACL-2000. 

Gonzalo J., Chugur I. and Verdejo F. (2000) "Sense Clusters for Information Retrieval: Evidence from SemCor 
and the EuroWordNet Interlingual Index", Proc. of SIGLEX Workshop on Word Senses and Multi-linguaiity 
in ACL-2000. 

Yoo-Jin Moon. Hannam University, Daejun, Korea, vimoon@mail.hannam.ac.kr 

Kyongho Min, Auckland University of Technology, Auckland, Newilealand. K ypn .qhpJIi.iD. @aut,ac.nz 


87 



ItalWordNet in an Annotation Task: A Chance for Discussion 

Claudia Soria 
Francesca Bertagna 
Nicoletta Calzolari 

Abstract 

In this paper we suggest how the SENSEVAL exercise can be profitably used for evaluating the lexical 
reference resources used for annotation. Some general reflections about the adequacy of the 
ItalWordNet database as a reference resource for an annotation task are proposed, focusing in 
particular on the lessons learned from the SENSEVAL-2 experiment. 

Introduction 

There are not well established methods for evaluating lexical resources. We think that an exercise like 
Senseval can be profitably used to partially take care of this situation. SENSEVAL is intended to 
provide a framework for evaluation of different word sense disambiguation systems. We claim, 
however, that the SENSEVAL experience can be exploited not only for its direct purpose, i.e. WSDS 
evaluation, but also as an interesting testbed for evaluation of lexical resources in a specific 
annotation task. Actually, SENSEVAL provides us with an ideal observational scenario for the analysis 
and comparison of human and automatic annotation. 

Our main concern here is to put forth evaluative observations concerning the use of a wn-like 
database as a reference resource for semantic manual and automatic annotation. 

In the first section we give a short description of the type of information available in the ItalWordNet 
lexicon, which was the lexicon used for the SENSEVAL sense disambiguation task for Italian. We also 
give a brief overview of the annotated corpus prepared for SENSEVAL-2, at least for what concerns the 
lemmas chosen, the number of occurrences in the corpus, etc. We then compare the annotation 
experiences of SENSEVAL- 1 and SENSEVAL-2. For the Italian task in SENSEVAL- 1 a traditional printed 
lexicon was used (see Calzolari and Corazzari, 2000), while in the second edition of the experiment a 
new kind of sense inventory was adopted as a reference resource for annotation, namely the 
ItalWordNet computational lexicon. It is evident why such a comparison is extremely interesting: a 
computational lexicon was precisely the kind of resource that was indicated as a solution to the range 
of problems (vagueness, circularity and/or inconsistency of definitions), raised by traditional 
lexicographic resources. 

We will also give a preliminary analysis of the impact that sense clustering can have on a 
disambiguation task. 

1. The WSD task in Senseval-2 

The main goal of the SENSEVAL initiative is to set up a framework for evaluation of WSD systems 
through a comparative analysis of their performance on similar data for different languages. Two 
aspects of the experiment are particularly relevant to our purposes, namely the preparation of the 
reference annotated corpus and the evaluation of systems’ performance against it. In both aspects we 
are mainly interested in analysing the results for what they can tell us about the issue of semantic 
annotation of a corpus, either manual or automatic, and the use of a computational lexicon for 
annotation. 

The first edition of the experiment (Kilgarriff and Palmer, 2000) was held in 1998. SENSEVAL-2 
(Toulouse, 2001) was organized following the very same principles of SENSEVAL- 1, the main 
difference being that the majority of the tasks used computational lexicons and in particular three of 



them used a wordnet-like database. In the following two sections we describe in more detail the 
corpus and the lexicon used for the Italian task 1 . 

1.1. Corpus and Lexicon 

The corpus and dictionary used for the Italian WSD task were provided by the resources 
independently developed in the framework of the SI-TAL project 2 . Differently from most other 
language tasks, the data had not been adapted in order to be used for the Senseval task, apart from the 
necessary format conversions. 

1.1.1. The Corpus 

The Italian SENSEVAL corpus consisted of about 3900 instances for 83 lexical entries (46 nouns, 21 
verbs, and 16 adjectives), with an average of 47 contexts per entry. 

The lemmas included in the SENSEVAL-2 corpus were selected on the basis of the following criteria: 

• polysemy in the reference lexicon; 

• polysemy attested in the corpus; 

• frequency. 

The average polysemy was of 5 senses per word (5 for the nouns subset, 6 for the verbs and 3 for the 
adjectives). 

The occurrences had been extracted from the SI-TAL Italian Syntactic-Semantic Treebank (ISST 3 ), 
consisting of two sub-components: a generic and a domain-specific (financial) corpus, of about 
215,000 and 90,000 tokens, respectively. The annotated material comprises instances of newspaper 
articles, representing everyday journalistic Italian language. 

Semantic annotation of ISST was performed manually using the ItalWordNet lexicon (henceforth 
IWN, see Roventini et al. 2000) as a reference resource (see Section 1.2). Semantic annotation 
consisted in assigning to each full word or sequence of words corresponding to a single unit of sense 
(such as compounds, idioms, etc.) a given sense number (referring to a specific synset) taken from 
IWN, plus specific features created for the annotation task to account for idioms, compounds and 
multi-words, figurative uses, evaluative suffixation, foreign words, proper nouns and titles, among the 
others. 

1.1.2. The Reference Lexicon 

The occurrences provided for the Italian WSD task were annotated according to the lexical-semantic 
database ItalWordNet, also developed within the framework of the SI-TAL Project 4 . 

ItalWordNet is an extension of the Italian wordnet built dining the EuroWordNet project (Vossen, 
1999). 


1 An in-depth description of the task and results of Senseval-2 is not the siim of die present paper, the interested reader is 
referred to (Bertagna, Soria and Calzolari 2001). 

2 SI-TAL (‘Integrated System for the Automatic Treatment of Language’) is a National Project involving several research 
centers in Italy. Its aim is the development of large linguistic resources and software tools for Italian written and spoken 
language processing. 

3 See Montemagni et al. (2000a) and Montemagni et al. (2000b). ISST is a joint effort among the Consorzio Pisa Ricerche 
(Pisa, Italy), Certia (Rome, Italy), Consorzio Venezia Ricerche (Venice, Italy) and IRST (Istituto per la Ricerca Scicntifica e 
Tecnologica), Trento, Italy. 

4 ItalWordNet is a joint effort between the Consorzio Pisa Ricerche (Pisa, I aly) and IRST (Istituto per la Ricerca Scientifica 
e Tecnologica), Trento, Italy. 


89 



The IWN database is composed of: 

i) a generic wordnet containing about 64,000 word senses corresponding to about 49,000 synsets; 

ii) a (generic) Interlingual-Index (ILI) which is an unstructured version of WordNet 1.5, also used in 
EWN to link wordnets of different languages; 

iii) a terminological wordnet, containing about 5,000 synsets of the economic-financial domain; 

iv) a terminological ILI, to which the terminological wordnet is linked; 

• 

v) the Top Ontology, a hierarchy of language-independent concepts, built within EWN and partially 
modified in IWN to account for adjectives (Alonge et al., 2000). Via the ILIs, all the concepts in 
the generic and specific wordnets are directly or indirectly linked to the Top Ontology; 

vi) the Domain Ontology, containing a set of domain labels. Via the ILIs, all the concepts in the 
generic and specific wordnets are directly or indirectly linked to the Domain Ontology. 

2. A Comparison of Senseval-1 and Senseval-2 

The main difference between SENSEVAL-1 and Senseval- 2, at least as far as the organization of the 
Italian WSD task is concerned, is that in the two editions of the competition two different types of 
lexical resources were used. In SENSEVAL-1, the Italian corpus was annotated according to a 
traditional printed dictionary (see Calzolari and Corazzari 2000, p. 62). 

We are now provided with the opportunity to compare the SENSEVAL-1 and SENSEVAL-2 annotation 
experiences from the point of view of the impact of a different type of lexical resource on the 
annotation task. It is worth noting that an outcome of the SENSEVAL-1 evaluation was exactly the 
recommendation that a computational lexicon be used. 

In the remaining of this paper we will overview manual and automatic annotation in Senseval-2, 
with particular emphasis on inquiring whether there have been significant qualitative and/or 
quantitative differences or improvements in the task of tag assignment. For a detailed description of 
the Italian SENSEVAL-1 task and results, see Calzolari and Corazzari (2000). 

2.1. Manual Annotation 

Manual annotation of the corpus was performed independently by two different annotators: the first 
annotated the data during the phase of ISST building, while the second performed the annotation 
when the SENSEVAL-2 subset was extracted. Annotators used two different tools in parallel: a tool 
especially tailored to semantic annotation, that allowed the display of the corpus sentences containing 
a given lemma, and the tool for the browsing of the data in the semantic net. 

The annotator could thus annotate the occurrences of a lemma in the corpus on the basis of the IWN 
sense inventory available for that lemma which was displayed in a separate window. 

The two versions of manual annotation provide the basis for calculating agreement rates and for 
highlighting problematic cases. 

The following table shows the results in terms of full agreement for each part-of-speech: 



Nouns 

Verbs 


Total 

occurrences 

2222 

889 

773 

LuJM .»U. U* 

2102 

802 

675 


94,6 

C£Si 

87,3 


Table 1: Full agreement rate for each PoS 


90 












The results show an overall high agreement between human annotators with a decrease from nouns to 
adjectives. The pattern is consistent with the results of Calzolari and Corazzari (2000) and those of 
Fellbaum et al. (2001). 


Another interesting issue is to verify whether any correlation can be established between disagreement 
rate and polysemy of the lemmas. We thus considered, for each part-of-speech, those lemmas that 
proved to be more problematic for annotators, and related them to their polysemy, expressed as the 
number of senses attested in the lexicon. In order to be more accurate, however, we also related the 
disagreement rate to the actual polysemy of lemmas, namely the number of senses actually attested in 
the corpus. Tables 2,3, and 4 show the ten most problematic lemmas for each part of speech. 


Lemma 

Pol. 

Act. Pol. 

Disagr. 

Colpire (to hit) 

El 

4 

44,2% 

WSSSmzM: 

3 

2 

25,9% 

Coprire (to cover) 

14 

7 

25% 

Lasciare (to leave) 

9 

6 

22,2% 

Scoprire (to discover) 

8 

6 

20,7% 

Capire (to understand) 

5 

. 3 

12,2% 

Entrare (to go in) 

5 

5 

11,8% 

Trovare (to find) 

8 

7 

11,1% 

Vedere (to see) 

17 

7 

8,8% 

Dichiarare (to declare) 

6 

4 

5,8% 


Table 2: The ten most disagreed verbs. 


Lemma 

Pol. 

Act. Pol. 

Disagr. 

Posto (place) 

3 

3 

25% 

Ora (hour) 

3 

3 

23,1% 

Colpo (blow) 

m 

5 

20% 

Senso (sense) 


4 

18,7% 

Controllo (control) 

6 

5 

18,2% 

Rischio (risk) 

2 

2 

16,7% 

Forza (strength) 

3 

2 

14,3% 

Politico (politics) 

4 3 

14,3% 

Mondo (world) 

9 3 

12,7% 

Opera (work) 

8 5 

11,3% 


Table 3: The ten most disagre:d nouns 


91 

















































































Lemma 

Pol. 

Act. Pol. 

Disagr. 

Solo (alone) 

3 

3 

30,6% 

Nuovo {new) 

3 

3 

24,1% 

Lungo {long) 

EH 

2 

20% 

Pronto {ready) 

EH 

4 

18% 

Possibile {possible) 

2 

2 

17% 

Generale {general) 

3 

2 

12% 

Grande {big) 

6 

5 

12% 

Vero {true) 

3 

3 

11,1% 

Bello {beautiful) 

EH 

4 

8,7% 

Buono (good) 

6 

6 

5,4% 


Table 4: The ten most disagreed adjectives 


The results highlight no apparent correlation between the polysemy of a lemma (be it actual or 
general) and the disagreement rate. Indeed, highly polysemous words such as the verb vedere (to see), 
or the noun linea {line), with 17 and 11 senses, respectively, scored a low disagreement rate (8,7% 
and 8,3%). On the other hand, the lemmas that appeared more difficult to disambiguate were in fact 
those with very few senses, such as colpire {to hit, 4 senses, 44,2%), posto {place, 3 senses, 25%), or 
the adjective solo {alone, 3 senses, 33,3%). 

2.1.1. Some Comparative Observations on Manual Annotation 

A comparison of the results obtained in SENSEVAL-2 with those of SENSEVAL-l highlights at least 
two issues. 

The first is the fact that SENSEVAL-2 results display a considerable quantitative overall improvement 
in terms of agreement rate, as Table 5 illustrates. 



Senseval-1 

Senseval-2 

Nouns 

85,3 

94,6 

Verbs 

79,4 

90,2 

Adjectives 

62 

87,3 


Table 5: Agreement rate in Scnseval-1 and Senseval-2 


On the qualitative side, on the other hand, an evaluation of the correlation between disagreement rate 
and polysemy of the lemmas highlights a different pattern. 

Calzolari and Corazzari (2000) report partially similar findings, observing how no apparent 
correlation could be established between polysemy of lemmas and disagreement rate. However, they 
do find a relationship between annotators’ disagreement rate and actual polysemy, concluding that 
actual polysemy, namely polysemy attested in the corpus, seems more important than the potential 
degree of polysemy as it is attested in the reference lexicon. 


92 











































This latter finding is not warranted by our data, where no correlation at all can be established between 
disagreement rate and polysemy. 

We are thus requested to explain why, given a quantitative improvement in the overall agreement rate, 
this cannot be explained in terms of lesser polysemy of the lemmas analysed. 

More importantly, it seems that the reason of low disagreement between annotators must be found 
elsewhere than polysemy of the lemmas. 

Calzolari and Corazzari (2000) convincingly argued that disagreement between annotators was mainly 
due to the characteristics of the dictionary, and in particular ambiguity of dictionary readings, 
especially vagueness and excessive granularity of sense distinction. 

A closer analysis of the SENSEVAL-2 manually annotated data allows to conclude that exactly the 
same range of dictionary problems must be called for to explain the lack of agreement between 
annotators. 

Before trying to give an answer to these findings, we will first illustrate the results of automatic 
annotation in Senseval-2. 

2.2. Automatic Annotation 

In Senseval-2, two systems participated in the evaluation for the Italian task and the quantitative 
evaluation of their performance is given in the Senseval-2 proceedings (Edmonds and Kilgarriff 
(eds.), forthcoming). 

We provide here only a few observations concerning linguistic aspects related to their 
performance. However, a few technical details are at hand to better clarify some aspects of the 
experiment. 

The participating systems must assign an answer (i.e., a sense id) to each occurrence of a 
given lemma in the corpus. Three scoring policies are adopted in SENSEVAL -2: fine-grained 
scoring implies a one-to-one mapping between the gold-standard tags and the guess; 

coarse-grained scoring presupposes the availability of a sense subsumption table: 

both system’s answers and gold-standard tags are mapped onto coarse-grained senses 
and then a comparison between them is performed; mixed-grained scoring: if a sense 
subsumption hierarchy is available, then the mixed grained scoring gives some credit 

to choosing a more coarse-grained sense than the gold standard tag, but not full credit. 

For the analysis of the data we used the same method employed for the analysis of manual 
annotation: we considered a failure of the system as an instance of disagreement, and then 
we verified whether any correlation between most difficult lemmas and polysemy 

could be established. As for manual annotation, both potential polysemy and actual polysemy were 
considered. 

As far as automatic annotation is concerned, a preliminary analysis seems to confirm 

the tendency found in distribution of disagreement over the lemmas for manual annotation. 

There seems to be no correlation between the degree of polysemy (both actual and potential) 
and disagreement rate. For instance, a highly polysemous word such as the verb vedere 

{to see, 7 senses attested out of 17) or the noun lavoro {work, 6 senses attested out of 7), 

had a relatively low disagreement rate (47% and 52%, respectively). On the other hand, words 


93 



such as anno or fine (year and end , both 3 senses attested out of 4 lexicon senses) scored a 98% and 
91% of disagreement rate, respectively 5 . 


The analysis of automatic annotation would then confirm the pattern found for manual annotation, 
namely that a failure in sense disambiguation cannot entirely be due to the number of senses of a 
lemma. 

2.2.1. Some Comparative Observations On Automatic Annotation 

The results of the analysis of automatic annotation are almost overlapping with those reported by 
Calzolari and Corazzari (2000) for SENSEVAL-1 6 . 

In the discussion of manual annotation, we indicated how the reason for disagreement should be 
found elsewhere than in number of senses, suggesting instead that the way the different senses are 
distinguished could be a more relevant factor. 

In SENSEVAL-2, coarse grained scoring gave us the opportunity to evaluate the impact 
of sense distinction on sense disambiguation. We thus isolated some lemmas whose senses were 
poorly distinguishable in the lexicon, and the impact of sense clustering on WSD systems was 
evaluated. 

The results for some of the verbs and nouns are shown in the chart in the last page. 

3. General Remarks 

A few useful observations can be made from the analysis of manual and automatic annotation in 
SENSEVAL-1 and SENSEVAL-2. 

3.1. Quantitative Improvement 

Manual annotation performed substantially better than in the previous edition of the experiment (cf. 
section 2.1.1). 

The improvement can be explained in terms of: 

1. better coverage of the IWN resource; 

2. method of creation of the resource. 

IWN, unlike a traditional lexicographic resource, has been built taking into consideration the 
annotators’ needs and feedback. 

This is due to the fact that the IWN database was still under construction while the ISST was being 
built. 

A protocol to regulate the interaction between the IWN codifiers and the ISST annotators was 
established, and it has been possible to create or to adjust a sense (or a lemma) when it was needed by 
the annotation task. 


s In any case, we must consider the fact that the low performance of the two systems for the Italian WSD task prevent us 
from drawing anything more than tentative conclusions 


94 



Therefore, while a traditional lexicographic resource is usually created independently from an 
annotation task, the construction phase of IWN benefited, at least partially, from the information 
derived from the annotation of the ISST. 

3.2. Analysis of the Disagreement Rate 

As far as disagreement is concerned, the reason for both manual and automatic disagreement seems to 
lie more in the way the different senses arc distinguished than in the number of senses available in the 
resource. 

From a qualitative point of view, the evaluation of the specific cases highlights the persistence of the 
same problems encountered using a traditional printed dictionary. 

The main reason for the disagreement between annotators seems mainly connected to the vagueness, 
ambiguity and granularity of the sense distinction. 

Some senses that would seem easily distinguishable according to sense distinctions present in the 
dictionaries turn out to be hardly applicable in most corpus occurrences. 

A typical example is given by the adjective possibile (possible): 

1. possibile - che pub esistere, che pud essere fatto (that can exist, that can be done ) 

2. possibile, probabile, verosimile -- detto di cio che & verosimile e pud pertanto realizzarsi; che e 
simile al vero e come tale pud essere creduto (that is likely and that might be) 

This distinction follows the one present in an Italian dictionary (Garzanti). The first sense refers to a 
more “pratical” nuance of the possibility and it is linked to the {realizzabilc, attuabile, effettuabile, 
praticabile, eseguibile} (feasible etc..) synset. The second sense refers to a more “philosophical” 
nuance of the concept of possibility in terms of likelihood. 

The adjective possibile turned out to be one of the hardest to disambiguate and one with the highest 
disagreement rate. 

In what follows we exemplify some contexts of occurrence of the word: 

DalVincrocio dei dati in possesso delle Camere di commercio con quelli in possesso di 
grandi archivi telematici, come Inps , Enel, Inail, sard infatti <hezd>possibile</head> in 
Juturo ottenere il controllo in tempo reale dell' intero sistema delle imprese italiane che 
attualmente conta quasi 4 milioni di posizioni. (sense l) 7 

Forse non e il <hea.d>possibile</b.ead> arrivo di Schumacher a demoralizzare i piloti ma la 
china discendente che la Ferrari sembra aver imboccato. (sense 2) 8 

E‘ necessario infatti stimolare tutti i <head>possibili<'-Jhead> recuperi di produttivita; e 
comportamenti piu concorrenziali per riflettere sui prezzi le diminuzioni dei costi e per 
contenere la spesa pubblica. (sense 1 or 2?) 9 

This distinction is probably too subtle and difficult to use in the corpus and there is a “grey zone” in 
which the two senses overlap. However, this happens much more often than one could expect. 

7 (...) in the future it will be possible to control the whole system of the Italian companies (...) 

8 What discourages the drivers is not the possible Schumacher’s arrival, bul (...) 

9 It is necessary to stimulate all the possible recovery in productivity (...) 


95 



3.3. Coarse-Grained Results 

Sense clustering is significantly related to an improvement of the performance of automatic 

annotation. 

The coarse-grained results show that a serious rethinking of the sense distinction is needed in order to 

achieve a better clarity and simplicity. 

The effects of clusterization are most apparent in the case of the lemma anno (year), whose sense 

distinctions are as follows: 

1. anno -- tempo necessario alia Terra per compiere il suo giro intomo al Sole (the time employed by 
the Earth to turn around the Sun) 

2. anno — periodo di dodici mesi in genere (a generic period of twelve months) 

3. anno - periodo di tempo non determinate, di cui si sottolinea la lunghezza (an undetermined 
period of time, usually very long) 

4. annata, anno - periodo di tempo (sp. nell'agricoltura); arco di tempo in cui si svolge un ciclo di 
attivita (anno accademico, anno liturgico) (a period of time, e.g. in agricolture; the span of an 
activity dele) 


After that senses 1 and 2 had been merged, one of systems improved its performance going from 103 
to 34 negative answers (cf. chart 1). 

The same sense distinction between senses 1 and 2 can be found in the Italian printed 
dictionary and also in WordNet 1.5. It refers to a distinction between an astronomical- 
scientific sense of year and a more general, everyday used sense as “time measure”. 

This distinction was not problematic for the human annotators (who always used sense 2 
with the exception of the occurrences of anno solare, solar year) while it turned out to be very 
problematic for the automatic systems, which applied sense 1 only. 

The case of the lemma “anno” raises another issue concerning the possibility in the WordNet 
model to discriminate among the different senses: all the four senses of “anno” are related 
to the same hypemym ({tempo, periodo}) (time, period), and the same Top Concepts (Time e 
Quantity). 

It is difficult to conceive how an automatic system could distinguish among the different senses 
without at least a parallel hierarchical distinction. 

While human disambiguation can be done even on the basis of the mere definitions, 
computational resources are useful only to the extent they provide multi-dimensional 
information; the model should provide the highest expressiveness in terms of sense-discriminating 
power. 

Conclusions 

Starting from the SENSEVAL-2 experience, we would like to make a few general remarks, both about 
the adequacy of available lexical-semantic reference resources for WSD tasks and about the overall 
task of lexical-semantic annotation. 

During the last years, many researchers have noted that it is misleading to reproduce in the lexical 
resources what Fillmore calls the “checklist theories of meaning” (Fillmore, 1975). 


96 



Kilgarriff (1997) and Hanks (2000), quoting Sue Alikins’ well-known sentence “I don’t 
believe in word senses”, expressed their skepticism about the possibility to capture 
through sense enumeration the overlaps, vagueness, and interplay of the different uses of a 
word in a language. Yet, these uses are exactly what contribute to giving the language its 
extraordinary dinamism and expressive power. 

The information available in the ItalWordNet lexicon (and the same probably holds for the state of the 
art in general) fails to account for the contextual aspects tied to word usage. 

Fellbaum et al. (2001) similarly argue that what they call the “dictionary model of word 
representation” does not allow to realize a really effective rep resentation of the linguistic behaviour as 
far as the meaning is concerned. 

However, it is precisely this model that has heavily influenced the practical realization of 
our lexical resources: “WordNet’s entries resemble those of a traditional dictionary, 
though its organization is not alphabetical but that of a semantic network” (Fellbaum et al. 2001, p. 4) 

This raises the more general issue of the relationship between i) a lexical resource where senses (as 
well as other lexical information at other levels of linguistical analysis) are by necessity somehow 
“decontextualized” (to be able to capture generalizations) and ii) a corpus sense annotation task, 
where, on the contrary, contextualization plays a predominant role and calls for a range of pragmatic 
issues. 

We think that an important challenge in our field would be the transformation of theories able to deal 
with the extreme flexibility of meaning into real, lange-scale jind exploitable resources. 

At the moment, it doesn’t seem useful to ask ourselves how many senses a lexical resource 
should have. It seems more useful trying to orient a resource towards different 
kinds of LE applications, in order to meet the different requirements that different tasks 
could have. Machine translation probably needs a more fine-grained representation 
of meanings in order to deal with the many idiosyncrasies raising with generation, 
while maybe coarse-grained sense distinctions are enough for information retrieval applications. 

In the near future, we would like to study an effective way to make of ItalWordNet a more flexible 
resource, able to provide different sense clusterizations for different uses. 

References 

Antonietta Alonge, Francesca Bertagna, Nicoietta Calzolari, Adriana Roventini, 
Antonio Zampolli. (2000) Encoding information on adjectives in a lexical- 
semantic net for computational applications, in Proceedings of the 1 st NAACL Meeting, Seattle, pp. 
42-49. 

Francesca Bertagna, Claudia Soria, Nicoietta Calzolari (forthcoming) The Italian Lexical Sample 
Task , Edmonds and Kilgarriff (eds.), Proceedings of Sensevai -2, in press. 

Nicoietta Calzolari, Omella Corazzari. (2000) Senseval/Rcmanseval: The Framework for Italian , 
A. Kilgarriff and M. Palmer (eds.) Special Issue on Sensevai, Computers and the Humanities 
34 Nos. 1-2, pagg. 61-84, Kluwer, Netherlands. 

Christiane Fellbaum, Martha Palmer, Hoa Trang Dang, Lauren Delfs, Susanne Wolf. 
(2001) Manual and Automatic Semantic Annotation with WordNet , WordNet and Other 
Lexical Resources: Applications, Extensions and Customizations, NAACL 2001 Workshop, 
Pittsburgh, USA. 


97 



Charles J. Fillmore. (1975) An alternative to checklist theories of meaning , Papers from the 1 st Annual 
Meeting of the Berkeley Linguistic Society, pp. 123-132. 

Patrick Hanks. (2000) Do word meanings exist?, A. KilgarrifT and M. Palmer (eds. Special Issue on 
Senseval) Computers and the Humanities 34 Nos. 1-2, pagg. 205-215, Kluwer, Netherlands. 

Adam Kilgarriff, Martha Palmer, (eds.) (2000) Special Issue on Senseval, Computers and the 
Humanities ,. 34 Nos. 1-2, pagg. 61-84, Kluwer, Netherlands. 

Adam KilgarrifT. (1997) I don 7 believe in word senses ITRI-97-12. 

Alessandro Lenci, Nuria Bel, Federica Busa, Nicoletta Calzolari, Elisabetta Gola, Monica 
Monachini, Antoir.e Ogonowsky, Ivonne Peters, Wim Peters, Nilda Ruimy, Marta Villegas, 
Antonio Zampolli. (2000) SIMPLE: A General Framework for the Development of Multilingual 
Lexicons , International Journal of Lexicography, XIII (4): pp. 249-263. 

Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta Calzolari, Omella 
Corazzari, Antonio Zampolli, Francesca Fanciulli, M. Massetani, Remo Raffaelli, Roberto 
Basili, Maria Teresa Pazienza, Dario Saracino, Fabio M. Zanzotto, Nadia Mana, Fabio 
Pianesi, Rodolfo Delmonte. (2000) The Italian Syntactic-Semantic Treebank: Architecture , 
Annotation, Tools and Evaluation, LINC-2000, Luxembourg. 

Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta Calzolari, 
Omella Corazzari, Antonio Zampolli, Francesca Fanciulli, M. Massetani, Remo Raffaelli, 
Roberto Basili, Maria Teresa Pazienza, Dario Saracino, Fabio M. Zanzotto, Nadia Mana, 
Fabio Pianesi, Rodolfo Delmonte. (2000) Building the Italian Syntactic-Semantic 
Treebank, in ‘‘Building and Using Syntactically Annotated Corpora”, Anne Abeille (editor), 
Language and Speech Series, KLUWER, Dordrecht. 

Adriana Roventini, Antonietta Alonge, Nicoletta Calzolari, Bernardo Magnini, Francesca 
Bertagna. (2000) ItalWordNet: a large semantic database for Italian , in “Proceedings of 
the 2nd International Conference on Language Resources and Evaluation”, Athens. 

Piek Vossen. (ed.) (1999) EuroWordNet General Document , http://www.hum.uva.nl/~ewn . 


98 




Chart 1: Improved Results for the Clustered Senses 


Claudia Soria, Istituto di Linguistica Computazionale-CNR, Via Moruzzi 1, 56100 Pisa, Italy, soria@ilc.pi.cnr.it 
Francesca Bertagna, Consorzio Pisa Ricerche, Via S. Maria 40, 5C100 Pisa, Italy, f.bertagna@ilc.pi.cnr.it 
Nicoletta Calzolari, Istituto di Linguistica Computazionale-CNR, Via Moruzzi 1, 56100 Pisa, Italy, 
glottolo@ilc.pi.cnr.it 


99 






















Distinguishing Concepts and Instances in WordNet 

Enrique Alfonseca 
Suresh Manandhar 

Abstract 

Many lexical databases make a distinction between concepts (synsets that represent a class of 
things of interest) and instances (examples of concepts). However, that information is not present 
in WordNet. We use empirical evidence that concepts and instances are treated in different ways in 
language, to show that the distinction is not merely theoretical, but that it also affects how a word is 
used. We also describe several NLP applications that could benefit from it, and propose a criterion 
to annotate WordNet with that information. 


1 Introduction 

Many taxonomies handle two different kinds of objects. A concept represents a set of things of interest 
that have something in common; while an instance is a single example of a concept. For illustration, 
human is a concept that can denotate each one of the instances from the Homo sapiens species; while 
Shakespeare is an instance of that concept, and denotates a single instance. This distinction will be made 
clear in the following sections. Usually, but not always, common nouns represent concepts and proper 
nouns represent instances. 

As far as we know, there has been little attention to the fact that WordNet contains both concepts 
and instances in the semantic network of nouns, with no distinction between them. However, it would 
be useful for many applications to know which synsets are concepts and which ones are instances, 
because they have different properties. Other taxonomical resources, such as Cyc [Lenat and Guha, 
1990] and Ontolingua [Farquhar et al., 1997], have implemented this distinction. We have also collected 
experimental evidence that concepts and instances are used in different ways in language, so there are 
good reasons to consider them apart. 

We describe here a work to enrich WordNet with information about which synsets represent instances. 
In section 2 we describe the original idea and its applications; section 3 deals with the manual annotation 
of WordNet, and section 4 with the experiments performed. Finally, section 5 lists the conclusions of our 
work. 

1.1 Applications of our work 

An important application of this work is to be able to merge WordNet with other existing ontologies, 
such as Cyc or ontologies developed using Knowledge Representation Systems like Ontolingua, with no 
loss of information. Importing and merging ontologies between systems is a task that has received much 
attention in the last few years, and which is important for projects such as the Semantic Net to be 
successful (Maedche and Staab, 2001]. 

Another application in which we have actually used this work [Alfonseca and Manandhar, 2002] is 
that of extending ontologies, such as WordNet, with new concepts. A system that identifies domain- 
dependent synsets and extends WordNet with them (a task called Ontology Refinement), needs to find 
the WordNet synset that is the hypernym of each of the learnt concepts. If instances were marked as 
such, then they would automatically be non-candidates to be hypernyms, because only concepts can 
be hypernyms. For example, if a program is analysing texts about sailing and it finds the word pram 
referring to a sailboat, it may suggest the WordNet synset sailboat as a possible hypernym, but it will 
never suggest a synset such as Mayflower , because it refers to an instance of a boat, and cannot have 
hyponyms. 

For Question Answering systems, it is useful to know whether the answer is a concept or an instance. 
For example, the answer to a question such (la) is an instance of president , and the answer to (lb) is a 
concept. 

(1) a. Which U.S. president was the first to live in the White House? 



b. What drug can be used to treat malaria? 

Finally, this work can also be applied to Text and Explanation Generation, because concepts usually 
need to be used with a determiner, while instances need not. 

2 A taxonomy with instances and concepts 

We can consider the WordNet semantic network, in its current implementation (version 1.7), as a tuple 
W = (C,S,fc.,h s ,ll) where 

• £ is the set of lexical entries (words). 

• S is the set of synsets. 

• fc '■£-*■ S* is a function that links the lexical entries with the synsets that contain them. 

• hs : S -* S*, called hypemymy, arranges the concepts and instances in a hierarchy. 

• 72. is the set of non-taxonomic relations. 

We argue that, inside S , there are two different kinds of entities: concepts and instances, as defined 
by Degen et al. [2001] (he calls instances individuals, and concepts universal): 

Individuals belong to the realm of concrete entities, which means that they exist within the 
confines of space and time. Universals, in contrast, are entities that can be instantiated 
simultaneously by a multiplicity of different individuals that are similar in given respects. 

We can think of universals as patterns of features which are realized by their instances. 

There is disagreement in the related literature about the interpretation of instances and concepts. 
One of the most widely accepted consider concepts as the sets of their instances [Montague, 1974]. Using 
an example from [Welty and Perucci, 1999], bird would be the set containing all possible birds, real and 
imagined, past and future; eagle would be a subset of bird, and the eagle Harry would be an instance (a 
member) of both sets. 

In First Order Predicate Logics (FOPL), instances are usually represented using constants, and 
concepts by using unary predicates. For example, the following sentence 

(2) John saw a man 

is usually translated into FOPL as 


3x.saw(John, x ) A man(x), 

and the concept man represents the set {a:: mon(a:)}. Nevertheless, there are some frameworks in which, 
for simplicity, instances are also represented with predicates. 

On the other hand, there are theories that consider instances are sets of universals, or sets of indi¬ 
vidualised properties, also called tropes. However, as Degen et al. [2001] points out, this interpretation 
poses some difficulties in dealing with different temporal profiles of the entities. 

2.1 An instance by itself, or an instance of something else? 

According to the definition stated above, when a synset represents features that can be realized by 
distinguishable instances, it is considered a concept, while when it refers to a concrete entity, or different 
manifestations of it, then it is an instance. But this definition still has to be further clarified. 

As Welty and Ferucci [1999] notes, something can be both a concept and an instance. He proposes 
an example in which we have four synsets: species, bird, Imperial eagle (Aquila eliaca), and Harry, which 
is an imperial eagle. Here Aquila eliaca is an instance of species, but at the same time it is a concept, 
one of whose instances is Harry (see Figure la). 

In fact, everything can behave as an instance or a concept, depending on the interpretation. For 
example, the McMillans can be an an instance of family or clan , but it is as well a concept that includes 
all its members, as in sentence (3). 

(3) That man is a McMillan 

Or a University software might be interested in creating new entries every year for each student. In 
that context, John Smith-1995 and John Smith-1996 can be considered instances of John Smith, which 
is an instance of student (see Figure lb). Summarising, 


101 




Figure 1: In (a), Aquila eliaca is at the same time a concept and an instance [Welty and Ferucci, 1999]. 
In (b), although the student John Smith would normally be used as an instance, in some occasions it 
may be useful to consider it as a concept. 


• On one hand, every concept is an instance of Concept. 

• On the other hand, every instance may be considered a concept whose instances are different 
manifestations of that same concept (e.g. at different times or observed by different people.) 

The previous reasoning indicates that, rather than classifying synsets as instances or concepts, we 
should label the hypernymy links as instance links or sub-concept links instead. However, by doing this 
we would lose the intended objective of making WordNet more similar to other existing lexical databases 
or Knowledge Representation Systems, such as Cyc or Ontolingua, in which entities have to be either 
instances or concepts. Therefore, we have taken the following middle point: 

• We classify synsets either as concepts or instances. 

• If a synset has hyponyms, it is being considered as a concept in the taxonomy, and thence we mark 
it as such. 

• Leaf synsets (synsets with no hyponyms) will be annotated either as instances or concepts , according 
to the relation they hold with respect to to their immediate hypernyms in the taxonomy. 

• If a leaf synset has several immediate hypernyms, and it is a subconcept of at least one of them, 
then we shall classify it as a concept, because there is at least one interpretation in which it will 
have its own instances. 

As we have discussed, probably this is not the optimal solution according to the theory, but we 
thought it is a good criterion in order to make WorclNet more similar to other existing taxonomical 
resources. 

2.2 Changes to WordNet 

If S is the set of synsets in WordNet, and hs is the hypernymy relationship, we propose to divide 5 in 
two subsets C and I, and to modify hs in the following way: 

• C will be a set of concepts. 

• 1 will be a set of instances. 

• S=C UI 

• hs is modified to hs • S -4 C m . 

Hence we modify the definition of W to include the instances: W = (£,«S,Z,/£,/is,7?.). If we define 
a leaf as any synset that has no hyponyms, 

Leaves (W) = {se«S, fin : hs(n) = s ) 

then, in our framework, instances can only be leaves in the WordNet taxonomy. However, some leaves 
can represent concepts, if they have not been instantiated. 


102 









synset id 

synset word 

concept-leaves 

instance-leaves 

n00005145 

person 

4,534 

2,913 

n00018241 

location 

735 

1,773 

n00016210 

psychological-feature 


618 

n00015211 

artifact 

7,494 


n00021056 

act, human .action 

4,213 

210 

other 


32,019 

1,399 

Total 


51,553 

7,033 


Table 1: Results of the manual annotation of instances and concepts in WordNet. 


3 Manual annotation of WordNet 

In English, the instances of some concepts are rarely named. These concepts include psychological 
features, acts, etc. For example the fear I felt yesterday at 12 noon is an instance of the concept /ear.- 
Although it is possible to give it a name such as My Midday Fear or any other identifier, these kinds 
of entities do not usually receive a proper name in the English language. In English, concepts whose 
instances are usually named are animate beings (e.g. people, animals, even plants), locations (e.g. cities, 
etc.), ideas and intellectual works (e.g. theorems, books, etc.) and some objects (e.g. ships, such as 
Mayflower). 

However, after examining WordNet in detail, we arrived to the conclusion that the language can 
contain, in theory, instances of practically every concept. A few examples of instances of unlikely entities 
are: 

• Creation , meaning God's act of bringing the universe into existence, which is a hyponym of action. 

• Gettisburg’s Address , which is a speech addressed by Abraham Lincoln during the war, and is a 
hyponym of speech-act. 

Therefore, we have classified by hand all leaf synsets in the nouns taxonomy, according to the relation 
they hold with their hypernyms. 

3.1 Manual annotation results 

WordNet version 1.7 contains 58,586 leaf-synsets, i.e. synsets with no hyponyms. All of them were 
manually annotated according to the criteria described above, and the results are displayed in Table 1. 
If our annotations are correct, there are 51,553 concepts and 7,033 instances among the leaves, oome 
of the branches with a high number of instances are person, location and psychological feature, this last 
branch because it includes all the mythological characters. 

3.2 Annotating difficult cases 

Language is always changing, and something that is considered an instance at a certain moment can, 
with time, come to be a concept. For example, the first Unix could be considered, at the moment it was 
released, an instance of the concept operating system. However, it can now be considered as any of the 
operating systems that have a similar architecture and a common set of commands, and that includes, 
among others, Solaris, BSD Unix, AIX, IRIX and Linux. As said above, when we saw that there exists 
a plausible interpretation in which a synset could be considered a concept, we have classified it as such. 

Other decisions were also difficult to take because the meaning of the synset had different points 
of view. For example, literary works such as Genesis, Exodus or Aesop’s Fables , can be interpreted, 
depending on the context, in different ways, as the following sentences show. In (4a), Genesis refers to 
the text contents, the intellectual work; while in (4b) the book title refers to the physical book, and in 
(4c) it refers to a set of pages in a book. The theory about the same word representing different views 
of the same thing was developed mainly in [Pustejovsky, 1995]. 

(4) a. Genesis was translated to Greek. 

b. Aesop’s Fables looks nice on the shelf. 

c. The boy tore off Genesis from his Bible. 


103 




avatar 


=s> Jagannath 
=> Kalki 
=> Krishna 
=> Rama 

=> Ramachandra 
=> Balarama 
=> Parashurama 


Figure 2: Rama should be an instance of avatar, but it is also a concept which has three different 
instances: the three incarnations. 

However, in WordNet these concepts are located under abstraction, not under object. Therefore, only 
the meaning in (4a) was considered to make the decision, and they were considered as instances, because 
the contents of the book are unique. 

If, while processing a text, we found sentence (4b), then we should be able to infer that we are talking 
about an instance of physical book, whose contents are the abstraction represented by the synset Genesis. 

Some concepts can have different names depending on their manifestations. For example, the planet 
Venus can be called morning star if it is visible in the early morning, or evening star if visible at sunset. 
If we were doing a full annotation of all hypernymy relationships, we could say that Venus is an instance 
of planet, and both morning star and evening star are instances of Venus. However, as we finally decided 
to annotate synsets and not relations, we are not able to capture this distinction. Although we only 
found a handful of problematic cases like this in the whole semantic network (see Figure 2), we think it 
is something that has to be addressed in the future. 


4 Automatic annotation of WordNet 

After the manual annotation, we have repeated the same procedure, but using an automatic algorithm 
for deciding which synsets represented instances. The motivation for this was twofold. In the first place, 
the automatic procedure can be used to predict whether a new domain-specific concept, not present in 
WordNet, is an instance or a concept without the need of human annotators. Secondly, by using very 
simple features, we want to show that instances and concepts, in language, are indeed used in different 
ways and can be easily detected, that is, there is empirical evidence that it is a difference that really 
exists in language. 

The learning model we chose was a Maximum Entropy model [Berger et al., 1996] [Ratneparkhi, 
1998]. In this framework, the problem consists in learning a probability model 

Pmb(s) = i • -Po(s) • exp (^2 X M 8 ) 

where Pme(s) is the probability that s is an instance, Pq(s) is an initial probability distribution, Z 
is a normalising constant, and /* are binary features about the examples. Using an iterative algorithm, 
it is possible to obtain values for the parameters A* so that the model classifies the training data as best 
as possible. We have used the Java package quipu.maxent, which is freely distributed [Baldridge et al., 
2001 ]. 


4.1 Features chosen 

Instances, in language, have some properties in common with mass nouns. For example, they ate rarely 
preceded by articles the and a or used in plural number. On the other hand, mass nouns can be quantified 
with weight, volume, etc. while instances cannot. We made use of these facts in order to choose the 
right features to distinguish them. 

At this point, it is necessary to make a distinction. Some instance names, such as Judas denoting 
the man that lived in Judea (sense 08886770) have undertaken other meanings such as someone who 
betrays (sense 08201644). These are different synsets: the first one represents an instance and, m such, 


104 



cannot have an article in the specifier position; and the second one represents a concept and can be 
quantified or preceded by a determiner. 

(5) a. Judas hung himself. 

b. Don’t trust him, I think he’s a Judas. 

c. There are several ‘Judases’ in that political party. 

We collected the material for each synset from the Internet. We performed for every synset an 
automatic search on an Internet search engine, using the words in the synset and the gloss and, sometimes, 
hypernyms and hyponyms, and retrieved several web pages that were examined with a program. In this 
way, we had a different corpus for each WordNet synset from which we could extract statistics about 
whether the words in a synset can be used with determiners or quantifiers, or whether they can be used 
in plural number. 

The following are several examples of features: 

/i (s) = true if any word in synset s was found preceded by the determiner the in the documents; false 
otherwise 

/ 2 (s) = true if no word in synset s was found with any determiner in the sample documents; false 
otherwise 

We also used capitalisation as a feature. Although not every capitalised word is an instance, many 
of them are, so this feature provides support in favour of considering the word an instance. 

f 3 (s) = true if any every word in synset s is capitalised; false otherwise 

4.2 Collection of features from Internet 

Internet has been used as a corpus to collect features about the way words are used. The procedure 
we chose is similar to the one described in [Agirre et al., 2000]. For each WordNet synset, a query is 
automatically generated containing the words in that synset and its hyponyms and hypernyms as positive 
examples; and the words from other synsets that contain different word-senses as negative examples. For 
example, the Altavista query for the word country , with the sense 06621523: 

state, nation, country, land, commonwealth, res publica, body politic -(a politically orga¬ 
nized body of people under a single government; "the state has elected a new president"; 
"African nations”; ’’students who had come to the nation’s capitol”; ’’the country’s largest 
manufacturer”; ”an industrialized land”) 

is the following: 

“country” AND (“body politic” OR “commonwealth” OR “land” OR “nation” OR “res 
publica” OR “state” OR “Reich” OR “suzerain” OR “sea power” OR “great power” OR 
“major power” OR “power” OR “superpower” OR “world power” OR “city state” OR “ally”) 

AND NOT (“a people”OR “area” OR “rural area”) 

Next, the program performs a query to the Altavista search engine 1 , downloads the documents, prepro¬ 
cesses them and extracts the features. 

4.3 Results 

The training set was built with 300 concepts and 150 instances selected randomly between the WordNet 
leaves. We used a five-fold evaluation: the training set was divided in five subsets of the same size, each 
one with 60 concepts and 30 instances; in each experiment, four of them were chosen for training and 
the fifth for testing. The program automatically downloaded the documents and extracted the features. 
To measure the accuracy of this procedure, we used the manual annotation of WordNet that we had 
performed earlier (see the previous section). 

We calculated a baseline by considering that every synset is a concept. This gives us an accuracy of 

51,553 __ 

58 ' 5 The resulting accuracy for each of the five experiments is shown in Table 4.3. The final accuracy 
is 96.62%, with a mean of three mistakes in each test set. We believe this indicates that instances are 
indeed used in a different way in language, and they can be recognised using just a very small set of 
features.___ 

l http://altavista.digital.com 


105 



experiment 

accuracy 

1st. 


2nd. 

98.9% 

3rd. 


4th. 

98.9% 

5th. 

95.4% 

Mean 

96.62% 

Baseline 

88% 


Table 2: Results of the five-fold evaluation 


Synset 

Type 

Synset 

Type 

hobbit, Hobbit 

concept 

Danaan 

concept 

ore, Ore 

concept 

Ajax 

concept 

Ent 

concept (b) Atreus 

instance 

gollum, Gollum 

instance 

Idomeneus 

instance 

FYodo 

instance 

Tydeus 

instance 

Bag-End 

instance 

Diomed 

instance 


Table 3: Synsets extracted from The Lord of the Rings (a) and The Iliad (b), and classification received 
by the maximum entropy algorithm 


4.4 Analysis of the errors 

By examining the erroneous classifications by hand, we noted that the major source of errors were synsets 
that describe languages, such as English, French, Chinese , etc. They had been manually annotated as 
concepts, because a language can be considered as the concept that groups all its dialects. We based this 
decision, as well, in the fact the the synset for the English language (synset number 05689601) has as 
many as eight hyponyms, which refer to the more common English dialects: American English, cockney , 
Middle English, etc. 

However, languages are usually used without determiners, never in plural, and written capitalised, so 
the automatic algorithm misclassified them. The addition of the new feature 

/ 4 (s) = true if it is defined as a language or a dialect in the synset gloss; false otherwise 

increased the accuracy to 97.2%. 

The remaining classification errors were due to a variety of reasons: 1530s and 1770s were considered 
as plural words, and were misclassified as concepts; or Alien-wrench never appeared in our sample 
documents in plural, so it was finally mistagged as an instance. In fact, the other ten misclassifications 
were due to a lack of enough data, because words that can be used with determiners or in plural form 
never appeared like that in the small corpora downloaded from Internet. 

5 Classification of new concepts 

We have done another experiment by looking for new synsets in two different texts, Tolkien’s The 
■ Lord of the Rings and Homer’s The Iliad, some of which are displayed at Table 3. The features used are: 
the determiners with which they were seen in the texts; whether they have been used in plural or not; 
and whether they were capitalised always, sometimes or never. 

In the first case, we extracted the 42 unknown words that appeared 50 or more times in the whole 
document, and all of them were correctly classified. The word Gollum, which is a character of the book, 
appeared written in lowercase because that character used his name as an interjection when he spoke. 

In the second case, 28 unknown words were extracted. The most interesting case is that of the word 
Ajax, which was classified as a concept. In the text, we found that there are two characters in the text 
called Ajax, the author referred to them together quite often with the phrase “the two Ajaxes”, so it 
could be considered as a concept that includes both people (Figure 3), similar to the case of Rama in 
Figure 2: 


106 







=> Ajax son of Telamon 
=> Ajax son of Oileus 


Ajax 


Figure 3: Interpretation of the word A j ax found in The Iliad. When it refers to any person called Ajax , 
then it is a concept; while when it refers to a particular person, it is an instance 


6 Conclusions and Future Work 

We described here a work aimed at identifying concepts and instances in WordNet. We believe that this 
kind of information present in other lexical knowledge bases, such as Oyc (Lenat and Guha, 1990], is 
important with many possible uses. 

As discussed in section 2, we believe that this work would be more complete if, instead of annotating 
synsets, we had annotated the hypernymy relationships between them. Therefore, that is an open work 
that can be attempted in the future. 

Our experimental results show that with a very reduced set of features (capitalisation and determin¬ 
ers), and a rather small training set, a high accuracy can be obtained in distinguishing instances and 
concepts. That is a strong indication that the distinction between concepts and instances indeed exists. 

This work we believe will have applications in other areas such as a question answering and ontology 
acquisition. Our future work will also consist in applying the results of this work to these two fields. 


7 Acknowledgements 

This work has been partially sponsored by CICYT, project number TIC2001-0685-C02-01. 

8 Authors affiliation 

Enrique Alfonseca is an assistant lecturer at the Computer Science Department, Universidad Autdnoma 
de Madrid, and a part-time research student at the University of York. Suresh Manandhar is a lecturer 
at the Computer Science Department, University of York. 

Contact: {enrique, suresh}®cs.york.ac.uk 


References 

E. Agirre, 0. Ansa, E. Hovy, and D. Martinez. Enriching very large ontologies using the www. In In 
Proceedings of the Ontology Learning Workshop, ECAI, Berlin, Germany, 2000. 

E Alfonseca and S. Manandhar. An unsupervised method for general named entity recognition and 
automated concept discovery. In Poceedings of the First International Conference on General WordNet , 
Mysore, India, 2002. 

Jason Baldridge, Tom Morton, and Gann Bierner. Quipu maxent, 

https://sourceforge.net/projects/maxent/, 2001. 

Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra. A maximum entropy approach 
to natural language processing. Computational Linguistics, 22(1):39 71, 1996. 

W. Degen, B. Heller, H. Herre, and B. Smith. Gol: Towards an axiomatized upper-level ontology. In 
Proceedings of the International Conference on Formal Ontology in Information Systems, FOIS-2001 , 
2001. 

Farquhar, Fikes, and Rice. Tools for assembling modular ontologies in ontohngua. AAAI Press, Menlo 
Park, California, 1997. 

D. B. Lenat and R. V. Guha. Building Large Knowledge-Based Systems. Addison-Wesley, Reading (MA), 
USA, 1990. 


107 



A. Maedche and S. Staab. Ontology learning for the semantic web. IEEE Intelligent systems , 16(2), 

2001. 

R. Montague. Formal Philosophy. New Haven: Yale University Press, 1974. 

J. Pustejovsky. The Generative Lexicon. The MIT Press, Cambridge, Massachusetts, 1995. 

Adwait Ratnaparkhi. Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. 
Dissertation. University of Pennsylvania, 1998. 

A. C. Welty and D. A. Ferucci. Instances and classes in software engineering. Intelligence Magazine , 10 
(2):24-28, 1999. 



Cleaning-up WordNet’s Top-Level 

A Ido Gangemi, Nicola Guarino 
Alessandro Oltramarf, Stefano Borgo 


Abstract 

In this paper we propose an analysis and an upgrade of WordNet s top-level syn- 
set taxonomy of nouns. We briefly review WordNet and identify its main semantic limita¬ 
tions. Some principles from a forthcoming OntoClean methodology are applied to the onto¬ 
logical analysis of WordNet. A revised top-level taxonomy is proposed, which is meant to be 
conceptually more rigorous, cognitively transparent, and efficiently exploitable in several ap¬ 
plications. This work is a revision and extension of [3]. 

Keywords — WordNet, ontology, taxonomies, top-level 

1 Introduction 

The main goal of this paper is to present a conceptual analysis of WordNet's top level. We 
shall base our analysis on the OntoClean methodology, a powerful (yet not completed) set of 
theoretical tools for the ontological refinement of taxonomies which is based on previous 
work developed at LADSEB-CNR [7,11] and at ITBM-CNR [4]. Presenting here the full de¬ 
tails of this methodology would be premature. Rather, we intend to show how OntoClean can 
be applied to a practical example, testing at the same time its limits and capabilities. 


2 The OntoClean Methodology: Basic Distinctions 

The main point of the OntoClean methodology is the characterization of the concepts ap¬ 
pearing in an ontology in terms of formal meta-properties. Since these meta-properties are 
largely independent from a particular ontological commitment, and they are defined for an 
arbitrary domain, they, are a good way to make ontological choices explicit. 

The set of meta-properties used in this paper is informally summarized below. We limit our¬ 
selves to an intuitive description, pointing to previous work for the technical details as de¬ 
scribed below. A set of concepts (partially) characterized by tfiese meta-properties, corre¬ 
sponding to our preliminary top-level choices, will be described in Section 4. 

The term "meta-property" adopted here is based on a fundamental distinction within the do¬ 
main of discourse: on one hand individuals or particulars form the object-level domain, on 
the other hand concepts or universal form the meta-level domain. Meta-properties induce 
distinctions in the meta-level domain, i.e. among concepts, while concepts induce distinctions 
in the object-level domain, i.e. among individuals. 

A clear distinction between the object-level domain and the meta-level domain is necessary to 
conceptual modelers: as we shall see in Section 3.2, mixing concepts and individuals in the 
same taxonomic structure (collapsing IS-A and INSTANCE-OF into a single hyperonymy rela¬ 
tion) is a first source of confusion in WordNet. 

Note that the object-level domain contains two groups of individuals, namely objects and 
event. We do not dig much into this distinction in this paper. However, here and there we will 
need to point out some notions that apply to one or Hie other group only. 



2.1 Formal meta-properties 


Before briefly introducing the meta-properties used in this paper, let us give to the reader the 
most updated references to the relevant previous work. The most recent account of notions of 
rigidity and identity is [7]; types and roles have been discussed in [6]; the definitions involv¬ 
ing unity and plurality have been introduced in [2]; the notions of dependence and extension- 
ality are introduced here in a slightly different form with respect to previous papers, and are 
still work in progress. 

2.1.1 Rigidity 

A property is essential to an individual iff it necessarily holds for that individual. 

A property is rigid (+R) iff, necessarily, it is essential to all its instances. A property is non- 
rigid (-R) iff it is not essential to some of its instances, and anti-rigid (~R) iff it is not essen¬ 
tial to all its instances. 

For example, person is usually considered as rigid, since every person is essentially such, 
while student is usually considered as anti-rigid, since every student can possibly be a non¬ 
student. 

2.1.2 Identity 

A property carries an identity criterion (+1) iff all its instances can be (re)identified by re¬ 
ducing their identity to the identity of some suitable characteristics. A property supplies an 
identity criterion iff such criterion is not inherited by any subsuming property. 

For example, person is usually considered as supplying an identity criterion (although deter¬ 
mining which one could be hard), while student just inherits the identity criterion of person , 
without supplying any further identity criteria. 

2.1.3 Dependence 

An individual x is constantly dependent on y iff. at any time, * can’t be present 1 unless y is 
fully present (that is, all its parts are present), andy is not part of x. For example, a person’s 
life is constantly dependent on that person, while the vice versa does not hold, since, at any 
time, there are some proper parts of one’s life that are not present. In general, an event is con¬ 
stantly dependent on (at least some of) its participants. 

A property P is constantly dependent (+D) iff, for all its instances, there exists something they 
are constantly dependent on. 

A property P is notionally dependent (+ND) on another property Q iff, whenever P(x) holds, 
then Q{y) must hold for a different y (which is not a proper part of x). For instance, student 
can be seen as notionally dependent on teacher. The event of ”my listening to yesterday De- 
peche Mode's concert" can be seen as notionally dependent on the event of "Depeche Mode 
playing during yesterday concert". 

In the following, we shall take "dependent" as synonymous of "constantly dependent", unless 
otherwise specified. 

2.1.4 Types and roles 

A rigid property that supplies an identity criterion and is not notionally dependent on another 
property, is called a type. An anti-rigid property that is notionally dependent is called a role. 
It is a material role if it carries (but not supplies) an identity criterion, and a formal role oth¬ 
erwise. In our example, person would be a type, and student a material role. Part is an ex¬ 
ample of formal role, since it carries no identity and is notionally dependent. 


1 An entity is present at a certain time if its temporal location overlaps that time. 


HO 




Type and Role are examples of formal meta-categories defined by means of multiple meta¬ 
properties. 

2.1.5 Extensionality 

The adjective "extensional" is used in different meanings in the literature. Sets are said to be 
extensional since they are identical when they have the same members, and properties are 
considered as extensional when properties with the same instances are taken as identical. In 
our framework, an individual is said to be extensional iff, necessarily, it is present whenever 
its proper parts are present. A property is extensional (+E) iff, necessarily, all its instances are 
extensional. It is anti-extensional (~E) iff, necessarily, all its instances are non-extensional, so 
that they can possibly change some parts while keeping their identity. 

2.1.6 Concreteness 

This meta-property is a bit less general than the previous ones, in the sense that it makes an 
ontological commitment towards the existence of physical (spatial, temporal or spatio- 
temporal) locations. We see physical locations as primitive qualities that individuals can have 
(see section 4.5). Without entering into any detail, we shall stipulate that an individual is con¬ 
crete iff it has a physical location. A property whose instances are necessarily concrete will be 
marked with the meta-property +C. 

2.1.7 Unity, singularity, and plurality 

An individual is unified by a certain (suitably constrained) relation R iff it is a mereological 
sum of entities that are bound together by R. For instance, the relation having the same boss 
may unify a group of employees in a company. An individual w is a whole under R iff it is 
maximally unified by R, in the sense that no part of w is linked by R to something that is not 
part or w. For instance, the mereological sum of all the employees in a company forms a 
whole. An individual is an essential whole iff it is necessarily a whole. 

A property P is said to carry unity (+U) if there is a common unifying relation R such that all 
the instances of P are essential wholes under R. For instance, the property sheet of paper car¬ 
ries unity because there is a relation R such that every two pieces of the sheet are connected 
by another piece of sheet. The property library carries unity because there is a relation R 
(e.g., being registered in the same catalog ), which may unify a number of different books. A 
property carries anti-unity (~U) if none of its instances is an essential whole. A property car¬ 
ries anti-unity (~U) if all its instances can possibly be non-wholes. Properties which refer to 
amounts of matter, like gold, water, ecc., are good examples of anti-unity (see §4.1) 

If every instance of P is an essential whole, but there is no unifying relation common to all 
instances of P, then we mark P with the property *U. For instance, as we shall see, we as¬ 
sume that each ordinary object is an essential whole, although different ordinary objects may 
have a different unifying relation. The property “ordinary object” will be marked therefore 
*U. 

An individual is a singular whole iff its unifying relation has a specific topological nature, so 
that this entity is “of one piece”. Under this view, a piece of coal is singular, while a lump of 
coal is plural. The idea is that entities of one piece, like a piece of coal, have a special cogni¬ 
tive relevance, which accounts for the natural language distinction between singular and plu¬ 
ral. They can be seen as maximally self-connected entities, where connection is taken as 
strong connection , like that existing between 3D regions sharing a surface in common. On the 
other hand, in the case of two pieces of coal merely touching each other, we have a form of 
weak connection, which exists between regions having a point in common. 


Ill 




A plural individual will be a sum of singular wholes which is not itself a singular whole. Plu¬ 
ral individuals may be wholes themselves or not. In the former case they will be called uni¬ 
tary collections ,; in the latter case arbitrary collections. 


2.2 Taxonomic constraints imposed by formal properties 

We will see in [7] that most of the formal distinctions introduced above impose important 
constraints on taxonomic relationships. In practice, if a property necessarily holds for all the 
instances of a certain concept, its negation cannot hold necessarily for all the instances of a 
subsumed concept. This means that, if F is a certain formal property, anti-F cannot subsume 
F: anti-rigidity cannot subsume rigidity, anti-unity cannot subsume unity, and anti- 
extensionality cannot subsume extensionality. As a further requirement, we have that two 
properties must be disjoint if they have incompatible unity of identity criteria. As a result of 
these constraints, we have that, after having analysed the formal properties holding for every 
concept in a taxonomy, we can easily check the ontological consistency of subsumption links, 
and restructure the taxonomy if necessary. 

3 WordNet’s Preliminary Analysis 

3.1 Experiment Setting 

We applied our methodological principles and techniques to the noun synsets taxonomy of 
WordNet 1.6. To perform our investigation, we had to adopt some preliminary assumptions 
in order to convert WordNet's databases 2 into a workable knowledge base. At the beginning, 
we assumed that the hyponymy relation could be simply mapped onto the subsumption rela¬ 
tion, and that the synset notion could be mapped into the notion of concept 3 . Both subsump¬ 
tion and concept have the usual description logics semantics [ 12 ]. In order to work with named 
concepts, we normalized the way synsets are referred to lexemes in WordNet, thus obtaining 
one distinct name for each synset: if a synset had a unique noun phrase, this was used as con¬ 
cept name; if that noun phrase was polysemous, the concept name was numbered (e.g. win- 
dow_l). If a synset had more than one synonymous noun phrase, the concept name linked 
them together with a dummy character (e.g. EqumeSEquid). 

Firstly, we created a Loom 4 knowledge base, containing, for each named concept, its direct 
super-concept(s), some annotations describing the quasi-synonyms, the gloss and the synset 
topic partition, and its original numeric identifier in WordNet; for example 

(defconcept Horse$Equus_Caballus 
MS-primitive EquineSEquid 
annotations ((topic animals) 

(word |horse|) 

(WORD jEquus caballus|) 

(documentation "solid-hoofed heibivorous quadruped domesticated since prehistoric times")) 

:identifier 1101875414|) 


Table 1 - Elements processed in the Loom WordNet kb 

noun entries 

116364 

equivalence classes: synonyms, spelling variants, quasi-synonyms 

50337 

noun synsets (with a (doss and an identifier for each one) 

66027 

nouns 

95135 

monosemous nouns 

82568 

polysemous nouns 

12567 

one-word nouns 

70108 

noun phrases 

25027 


J We used the Prolog WordNet database, the Grind database, and some others from the official distribution. 

3 We will show that this assumption is incorrect, and to maintain it an adaptation of WordNet's synset organization is required. 

4 Loom is a knowledge representation system that implements a quite expressive description logic [9]. 


112 



The elements processed in the Loom WordNet knowledge base are reported in Table 1. We 
report in Figure 1 an overview of WordNet's noun top-level as translated in our Loom knowl¬ 
edge base. The nine Unique Beginners are shown in boldface. 

3.2 Main problems found 

Once the Loom WordNet was created, we systematiciilly applied the OntoClean methodology 
to the upper taxonomy of noun senses. Let us discuss now the main ontological drawbacks 
we found after applying this cleaning process. 

3.2.1 Confusion between concepts and individuals 

The first critical point was the confusion between concepts and individuals. For instance, if 
we look at the hyponyms of the Unique Beginner Event, we'll find the synset Fall - an indi¬ 
vidual - whose gloss is "the lapse of mankind into sinfulness because of the sin of Adam and 
Eve", together with conceptual hyponyms such as Social_Event, Happening, and Miracle. 
Under TerritoriaLDominion we find Macao and Palestine together with Trust_Territory. The 
latter synset, defined as "a dependent country, administered by a country under the supervi¬ 
sion of United Nations", denotes a general kind of country, rather than a specific country as 
those preceding it. We found many other examples of this sort. 

We face here a general problem: the concept/individual confusion is nothing but the product 
of an "expressivity lack". In fact, if there was an INSTANCE-OF relation, we could distinguish 
between a concept-to-concept relation (subsumption) and an individual-to-concept one (in¬ 
stantiation). Taking the previous example, we could therefore say that Palestine is an instance 
of TerritoriaLDominion, while Trust JTerritory is subsumed by it. 

3.2.2 Confusion between object-level and meta-level: the case of Abstraction 

The synset Abstraction. 1 seems to include both abstract entities, such as Set, Time, and 
Space, and abstractions (meta-level concepts) such as Attribute, Relation, Quantity. From the 
corresponding gloss, an abstraction "is a general concept formed by extracting common 
features from specific examples" (New York skyscrapers, Solar System planets, Italian 
ministers). Abstraction seems therefore a psychological process of generalization, in accor¬ 
dance to Locke's notion of Abstraction ([8], p.211). This meaning seems to fit the latter group 
of terms (Attribute, Relation, Quantity), but not to the former. Abstract entities are so because 
not extended in space/time, which corresponds to a very different meaning of “abstract”. 
Moreover, it is quite natural to consider attributes, relations, and quantities as meta-level con¬ 
cepts, while Set, Time, and Space seem to belong to the object domain. 

3.2.3 Formal properties violations in subsumption relation 

Moving now to the field of meta-level categories, the most common violation we have regis¬ 
tered is about rigidity, which is bound to the distinction between roles and types. A role can¬ 
not subsume a type. Let's see an important clarifying example. 

In its first sense, Person (which we consider as a type) is subsumed by two different concepts, 
Organism and Causal.Agent. Organism can be conceived as a type, while Causal.Agent as a 
formal role. The first subsumption relationship is feasible, while the second one shows a ri¬ 
gidity violation. We propose therefore to drop it. 


5 Note that the sense numeration reported in our Loom kb is different from the WordNet’s original one. Nevertheless, the reader will easily 
recognize the synsets we are referring to. 

4 In the text body, we usually do not report all the synonyms of a synset (or their numeration), but only the most significative ones. 


113 



Abstraction.! 

Film 

Attribute 

PartSPortion 

Color 

Body.Part 

Chromatic.Color 

SubstanceSMatter 

McasureSQuantity$Amount$Quantum 

Body Substance 

Relation. 1 

Chemical Element 

Set_5 

FoodSNutrient 

Space. 1 

PartSPiece 

Time.l 

SubjectSContentSDepicted Object 

ActSHuman.ActionSHuman. Activity 

£vcnt 1 

Action. 1 

Fall 3 

Activity. 1 

HappeningSOccurrenceSNatural Event 

ForfeilSForfeitureSSacriftce 

CascSlnstance 

Entity {Something 

TimeSClip 

Anticipation 

Might-Have-Been 

Causal.AgentSCauseSCausal.Agency 

GroupSGrouplng 

Cell.l 

Arrangement_2 

InessentialSNonesscntial 

BioIogical.Group 

Li fe.FormSOrganismJBeingJ... 

CitizcnrySPeople 

ObjcctSPhysical.Object 

Phenomenon.! 

ArtifactSArtefact 

ConsequcnceSEffcctSOutcome... 

Edge.3 

Levitation 

Skin_4 

LuckSFortune 

Opening.} 

Possession 1 

Excavations... 

Asset 

Building. Material 

Liability$Financial_Obligation$... 

Mass_5 

T emtorySDominionJ... 

Cement_2 

Transferred .Property*... 

Bricks.and.Mortar 

Psychological Feature 

Lath.and.Plaster 

CognitionSKnowledge 

Body.Of.WaterSWater 

Structure 

LandSDry.LandSEarthS... 

Feeling. 1 

Location 

MotivationSMotiveSNeed 

Natural.Objcct 

State. 1 

Blackbody.Full_Radiator 

ActionS ActivitySActiveness 

Body_5 

BeingSBeingnessSExistence 

UniverseSExistenceSNatureS... 

Condition 

ParingSParings 

DamnationSEtemal.Damnation 


Figure 1. WordNet's Top Level. 


Someone could argue that every person is necessarily a causal agent, since "agentivity" (ca¬ 
pability of performing actions) is an essential property of human beings. In this case Causal_ 
Agent would be intended as a synonym of "intentional agent", and would be considered as 
rigid. But in this case it would have only hyponyms denoting things that are (essentially) 
causal agents, including animals, spiritual beings, the personified Fate, and so on. Unfortu¬ 
nately, this is not what happens in WordNet: Agent, one of Causal_Agent hyponyms, is de¬ 
fined as: "an active and efficient cause; capable of producing a certain effect; (the research 
uncovered new disease agents)". Causal_Agent subsumes roles such as Germicide, Vasocon¬ 
strictor, Antifungal. Instances of these concepts are not causal agents essentially. This means 
that considering CausaLAgent as rigid would introduce further inconsistencies. 

These considerations allow us to add a pragmatic guideline to our methodological techniques: 
when deciding about the formal meta-property to attach to a certain concept, it is useful to 
look at all its children. 

3.2.4 Heterogeneous levels of generality 

Going down the different layers of WordNet's top level, we register a certain "heterogeneity" 
in their intuitive level of generality. For example, among the hyponyms of Entity there are 
types such as Physical_Object, and roles such as Subject. The latter is defined as "something 
(a person or object or scene) selected by an artist or photographer for graphic representation", 



and has no hyponyms (indeed, almost any entity can be an instance of Subject, but none is 
necessarily a subject) 7 . 

For Animal (subsumed by Life.Form) this heterogeneity becomes clearer. Together with 
classes such as Chordate, Larva, Fictional.Animal, etc., we find out more specific concepts, 
such as Work.Animal, Domestic.Animal, Mate_3, Captive, Prey, etc. We are mduced to 
consider the formers as types, while the latters as roles. . 

In short, intuitively some synsets sound too specific when compared to their siblings. Look at 
them from the formal point of view we are developing, we can pinpoint their "different gen¬ 
erality" by means of the distinction between types and roles. 


4 The OntoClean preliminary top-level 

Let us now introduce below the top-level categories that we have used for our experiment, 
each one characterized by the meta-properties discussed above. They represent a first draft of 
the top-level distinctions we plan to use in our OntoClean methodology. They have been cho¬ 
sen in order to be as general and neutral as possible, although (differently from the formal 
properties used to characterize them) they reflect our cognitive bias, aimed at capturing the 
ontological categories lying behind natural language and human commonsense. 

For the time being, we have also avoided to impose a strong taxonomic structure to these 
categories, focusing on producing a more or less flat list of basic concepts. In the future, a 
further restructuring will be probably needed (for instance to better account for the contin- 
uants/occurrents distinction, which we have only marginally addressed here). However, the 
nature of the categories described below would not change. 

All categories listed below are considered to be rigid, as they are assumed to reflect essential 
properties of their instances. This essentiality is the result of our commonsense conceptuali¬ 
zation: when an amount of matter is considered as forming an object, the decision to consider 
this "objecthood" as an essential property is only the result of our conceptualization. 


4.1 Aggregates (~D, ~U) 

The common trait of aggregates is that they are independent entities, and none of them is an 
essential whole. This means that the corresponding property carries anti-unity (~U). We con¬ 
sider two kinds of aggregates: Amounts of matter and. Arbitrary collections. The latter can be 
also called groups , or perhaps sets ; we prefer however to use set for abstract entities, and 
group does sometimes denote something with an intrinsic unity. Arbitrary collections are just 
mere sums of wholes which are not themselves essential wholes (like the sum of a person s 
nose and a computer keyboard). Amounts of matter are extensional (+E), in the sense that 
they change their identity when they change some parts; arbitrary collections can be consid¬ 
ered as pseudo-extensional , in the sense that they change their identity when a member is 
changed, while a change in the parts of a member may be allowed. 

4.2 Objects (~D, *U) 

The main characteristic of objects is that all of them are independent essential wholes. This 
does not mean that the corresponding property (being an object) carries +U, since there is no 
common unity criterion for objects. Among objects, we distinguish between extensional bod¬ 
ies and ordinary objects. Bodies are considered to be extensional (+E), while ordinary objects 
are not (~E). This means that we assume that all ordinary objects can change some of their 

parts while keeping their identity. 


7 Wc can draw similar observations for relation_1 and set_5 with respect to abstraction.!, etc. 


115 



4.3 Events (+D, +E) 

Events are things that happen in time , i.e. temporal occurrences. They differ from objects in 
the way their parts are present in time: when an object is present, all its proper parts are pre¬ 
sent; when an event is present, some of its proper parts (e.g., the previous phases of that 
event) may be not present. For instance, the piece of paper you are reading now is fully pre¬ 
sent, while some temporal parts of your reading are not present any more. For us, this dis¬ 
tinction between objects and events is however just a result of our cognitive-linguistic bias, 
and we do not make any metaphysical commitment regarding the "primacy" of the former on 
the latter. 

Events can have temporal parts or spatial parts. Ail parts of events are events themselves. So 
object can’t be parts of event, rather they participate to events. Considering the temporal ex¬ 
tension of an event, like the execution of a symphony, a temporal part is for instance the exe¬ 
cution of its first movement. On the other side, a spatial part of an event is what is happening 
in a part of its spatial extension. For example, in a 100 meters race, the run of the participant 
on the 1st lane is a spatial part of the race. The important point is that all parts of an event are 
essential parts : if an event would change any of its parts, it would be a different event. Events 
are therefore extensional (+E). 

Our work on events is pretty much in-progress now, so we are not currently in the position to 
propose a stable taxonomy of events. However, according to recent works in linguistics and 
in philosophy 8 , we can divide events into states (e.g. "the air smelling of jasmine"), processes 
(e.g. "snowing"), accomplishments (e.g. "the sunset") and punctual events (e.g. "the cable 
snapping"). Together, processes and accomplishments can be considered as dynamic events , 
in the sense that something changes during the interval they occur. The "test of the Progres¬ 
sive" used by Terence Parsons [10] is a useful linguistic criterion for distinguishing between 
them (states are not expressible by the English progressive form), which however needs to be 
characterized in ontological terms. 

In addition, we can distinguish accomplishments from states and processes by considering the 
"sub-interval property" introduced by M.J. Cresswell [1]. In the case of accomplishments, a 
property holding for the whole event does not hold for all its proper parts (“the capsizing of a 
boat is not made of boat capsizing”), while the contrary is true for states and processes (every 
temporal part of my sitting here is still a sitting here). For this reason, we can say that states 
and processes are homeomerous, while accomplishments are not. The test of the progressive 
and the sub-interval property test are used only for non-atomic events: in this sense, these two 
analytic tools do not hold, trivially, for punctual events , which are just atomic. Therefore, 
punctual events have to be considered as disjoint from states and dynamic events. 

In future work we should take account of actions too, which involve the notion of agent , 
namely an object that participates intentionally to an event. 

4.4 Features (+D, -E, *U) 

Features are "parasitic entities", that exist insofar their host exists. Typical examples of fea¬ 
tures are holes, bumps, boundaries, or stains. Features may be relevant parts of their host, 
like a bump or an edge, or dependent regions like a hole in a piece of cheese, the underneath 
of a table, the front of a house, which are not parts of their host. All features are essential 
wholes, but no common unity criterion may exist for all of them. However, typical features 
have a topological unity, as they are singular entities. 


1 Especially to Mourelatos' paper Events, Processes , and States, in which the author tries to sum up a general ontology of events taking into 
account different approaches. 



4.5 Qualities (+D, +E, +U) 

Qualities and properties are often considered as synonymous, but they are not. Take a par¬ 
ticular object, like a rose: depending on its nature (and our way of perceiving it) it will exhibit 
some individual qualities, like a specific color, a size, a smell, etc. The way we classify these 
qualities may depend on our conceptualization of them, which strongly depends on our cul¬ 
ture and perceptual Capabilities. Properties like red, big, sweet are the result of classifying 
each of these qualities with respect to a specific conceptual space 9 [5]: so the rose is red be¬ 
cause its color is located in a certain region in the colors conceptual space. When we say that 
"red is a color" we are talking of a region in this space; when we say "I like the color of this 
rose" we are talking of an individual quality 10 . We assume that individual qualities do not 
change in time, while their conceptual location in a conceptual space can vary in time. For 
instance, we usually say that the color of a chameleon can change in time: according to our 
view, what changes in time is the value of the individual quality "color of the chameleon", 
namely its location in the conceptual space of colors, not the individual quality itself, which 
is always "the color of the chameleon". The main characteristics of qualities is therefore their 
being located in conceptual spaces, whose topological structure depends on the quality being 
considered. 

Speaking of their meta-properties, qualities are dependent entities. We also assume that they 
have no proper parts, so that they are trivially extensional and are trivially wholes. Finally, 
we assume that qualities of objects are physically located where the objects are located, so 
that qualities of concrete objects are themselves concrete. 

An important remark is that we take spatial and temporal locations of objects as individual 
qualities, too. This means that geometric space and time are considered as conceptual spaces. 

4.6 Abstractions (~C) 

Abstractions are entities that are not concrete, that is, they do not have a physical location. 
Conceptual spaces are the first example of abstractions: time, geometric space, length, color 
are all conceptual spaces, with different topological structure. Terms like red, long, old, cor¬ 
respond to regions in a conceptual space (and therefore to particulars, not universals). We can 
describe therefore the structure of a conceptual space with a first-order theory, using topo¬ 
logical notions: for instance, we can say that red is adjacent to brown. Depending on the way 
a conceptual space is partitioned, and on the metric (if any) imposed on it, we can have dif¬ 
ferent equivalent ways of describing a certain quality. 

Other examples of abstractions are sets, symbols, propositions, structures. 

5 WordNet Cleaned up: mapping WordNet into the OntoClean top-level 
Let us consider now the results of integrating the WordNet top concepts into our top-level. 
According to the OntoClean methodology, we have concentrated first on the so-called back¬ 
bone taxonomy , which only includes the rigid properties. Formal and material roles have been 
therefore excluded from this preliminary work. 

Comparing WordNet's unique beginners with our ontological categories, it becomes evident 
that some notions are very heterogeneous : for example, Entity looks like a "catch-all" class 
containing concepts hardly classifiable elsewhere, like Anticipation, Imaginary_Place, Ines¬ 
sential, etc. Such synsets have only a few children and these have been already excluded in 
our analysis. 


* This classification deals mainly with adjectives. This paper focuses on the WordNet database of nouns, nevertheless our treatment of 
qualities foreshadows a semantic organization of the database of adjectives too, which is a current desiderata in the WordNet community 

(see[l],p. 66). . 

10 Note that WordNet docs not distinguish between the two senses of "color in this case. 


117 



The results of our integration work are sketched in Table 2. Our categories are reported in the 
first column; the second column shows the WordNet synsets that are covered by such catego¬ 
ries (i.e., they are either equivalent to or included by them); the third column shows some hy- 
ponyms of these synsets that were rejected according to our methodology. Finally, the last 
column shows further hyponyms that have been appended under our categories, coming from 
different places in WordNet. The problems encountered for each Category are discussed be¬ 
low. 

5.1 Aggregate, Object, Feature 

entity$something is a very contused synset. As sketched in the table, a lot of its hyponyms 
have to be "rejected”: in fact there are roles (Causal_.Agent, Subject_4), unclear synsets (Lo¬ 
cation 11 ) and so on. This Unique Beginner maps partly to our Aggregate and partly to our 
Object category. Some hyponyms of Physical_Object are mapped to our new top concept 
Feature. 

By removing roles like Arrangement and Straggle, groupSgrouping becomes a partition of 
the Ordinary Object category (namely the third child of the top concept Object. 

In fact, hyponyms like Collection, Social_Group, Biological_Group ecc., are nothing but 
plural objects, supporting a clear unity criterion. 

P0SSESSI0N_1 is a role, and it includes both roles and types. In our opinion, the synsets 
marked as types (Asset, Liability, etc.) should be moved towards lower levels of the ontol¬ 
ogy, since their meanings seem to deal more with a specific domain - the economic one - than 
with a set of general concepts (except some concepts that can be mapped to Abstraction). 
This means that the remainder branch is also to be eliminated from the top level, because of 
its overall anti-rigidity (the peculiarity of roles). 

5.2 Abstraction, Quality 

ABSTRACTION.. 1 is the most heterogeneous Unique Beginner. It contains abstracts (Set_5), 
quality spaces (Chromatic_Color), qualities (mostly from the synset Attribute) and a hybrid 
concept (Relation.. 1) that contains abstracts, other entities, and even meta-level categories. 
Each child synset has been mapped appropriately. As we can see from the table, our Ab¬ 
straction top concept relates to its WordNet homonymous only with regard to few hyponyms. 
PSYCHOLOGICAL FEATURE contains both abstract entities (Cognition) and Events (Feeling_l), 
and its children synsets have been mapped accordingly (we have created a new Cogm- 
tive_Event concept to support such synsets under Event). 

5.3 Event 

event_1, phenomenon^!, state_1 , act are globally mapped to our Event branch, although 
- by simply looking at their children - it seems quite hard to explicit any criteria to maintain 
the original distinctions. A comprehensive analysis of lower taxonomic levels is needed, 
based on the approach we sketched in §4.3, which applies such criteria as durativeness, dy- 
namicity, intentionality, completion, etc., and also on a deep analysis of recent top ontologies 
as those developed for SIMPLE and EuroWordNet projects. A formal characterization of 
these criteria is an ongoing work. 

Conclusions . ,• 

The final results of our integration effort are sketched in Figure 2. This is as a preliminary 

restructuring of WordNet's top level by means of the OntoClean methodology. Our results 


M Referring to Location, we find roles {There, Here, Home, Base, Whereabouts), instances (Earth), and geometric concepts like Line, 
Point, etc.). 


118 



show that a serious ontological improvement is needed in order to thoroughly exploit the se¬ 
mantic richness of WordNet. For this purpose, exploiting the ontological constraints resulting 
from the OntoClean methodology provides a good guideline. At this point, it is not clear how 
to implement this methodology to get experimental data in a more comprehensive analysis of 
big taxonomies. We think that a practical application of this methodology will help us to en¬ 
rich our characterization of the meta-properties and also to isolate new basic distinctions. We 
welcome any development of the OntoClean methodology aiming to provide well docu¬ 
mented procedures for applications to real systems. 

Our research is still in progress: we hope we have paved the way for future work and possible 
cooperation. 

Acknowledgements . ~ . c . 

We would like to thank Claudio Masolo, Chris Partridge, Domenico Pisanelli, and Gen bteve 

for the fruitful discussions and comments on the earlier version of this paper. This work was 
partly supported by the Eureka Project IKF (EI2235, Information and Knowledge Fusion), 
and the National project TICCA (Tecnologie Cognitive per l'lnterazione e la Cooperazione 
con Agenti Artificiali). 



:overed Synsets 

.BffgffltfMJ-k'iiillE 

'll 


icerceatc 2 ! 

_L 



ubstanceSMatter* E 

I 

F 

jaTsSBi 

dass_5,Ccment_2, 

ubstancc,... 





Object 

EntitySSomething* 

‘ 

Anticipation, 
:au8al_Agent. 
maginary J’lace, 
Substance 


ExtcnsionaLBody 

Natural Object* 

J)ead_Body, Constel- 
ation. Stone. Nest... 


| K | 

ISH! 



Feature 

Rclevant_part 

partSportion* 



Dependent region 



Openings. 

Excavation$holc_in_ 
the Ground,... 

Quality 

Siii9 | | 

^^9 


T M i HM'IT'f 





chromatic_color_2 







Abstract_Entity 



Statement 1, 

Cognition, Airangement_2, 
Ownership 1.... 


ProDOsilion 1 



■BE 

set_5 



Quality_Spacc 

Attribute* 




space^l 



Time 

time_interval$interval* 

Eternity, Green- 
wich_Meun_Time, 
Present. Past, Future 


Color 

chromatic_color_l 



Event 


. 

PHENOMENONS *, 
STATES*, Cognitive 

Event, EVENTS*. 

Iact* 


119 
































TABLE 2 LEGEND : Hyponyms marked with are heterogeneous (some of them are to be moved elsewhere, 
some are roles, or some are instances); those marked with "!" have no hyponyms; those in upper case are 
WordNet Unique Beginners; those in italic are Top Concepts of the OntoClean ontology. 


References 

[1] Cresswell, M. J. 1996. Why Objects Exist but Events Occur. In R. Casati and A. Varzi (eds.), 
Events. Dartmouth Publishing Company, Aldershot: 449-456. 

[2] Gangemi, A., Guarino, N., Masolo, C., and Oltramari, A. 2001. Understanding top-level ontologi¬ 
cal distinctions. In Proceedings of IJCAI-01 Workshop on Ontologies and Information 

Sharing. Seattle, USA, AAAI Press. 

[3] Gangemi, A., Guarino, N., and Oltramari, A. 2001. Conceptual Analysis of Lexical Taxonomies: 
The Case of WordNet Top-Level. In C. Welty and S. Barry (eds Formal Ontology in Informa¬ 
tion Systems. Proceedings ofFOIS200l. ACM Press: 285-296. 

[4] Gangemi, A., Pisanelli, D., and Steve, G. 1999. An overview of the ONIONS Project: Applying 
Ontologies to the Integration of Medical Terminologies. Data and Knowledge Engineering , 31. 

[5] Gardenfors, P. 2000. Conceptual Spaces: the Geometry of Thought. MIT Press, Cambridge, Mas¬ 
sachusetts. 

[6] Guarino, N. and Welty, C. (eds.) 2000. A Formal Ontology of Properties. Proceedings of the 
ECAI-00 Workshop on Applications of Ontologies and problem-solving Methods,, Berlin, Ger¬ 
many. 

[7] Guarino, N. and Welty, C. 2001. Identity and subsumption. In R. Green, C. Bean and S. Myaeng. 
(eds.), The Semantics of Relationships: an Interdisciplinary Perspective. Kluwer (in press). 

[8] Lowe, E. J. 1998. The possibility of metaphysics. Clarendon Press, Oxford. 

[9] MacGregor, R. M. 1991. Using a Description Classifier to Enhance Deductive Inference. In Pro¬ 
ceedings of Seventh IEEE Conference on AI Applications: 141-147. 

[10] Parsons, T. 1996. The Progressive in English: Events, States and Processes. In R. Casati and A. 
Varzi (eds.), Events. Darthmouth Publishing Company, Aldershot: 20-47. 

[11] Welty, C. and Guarino, N. 2001. Evaluating Ontological Decisions with OntoClean. Communi¬ 
cations of the ACM. 

[12] Woods, W. A. and Schmolze, J. G. 1992. The KL-ONE family. In F. W. Lehmann (ed.) Seman¬ 
tic Networks in Artificial Intelligence. Pergamon Press, Oxford: 133-177. 

LADSEB-CNR and ITBM-CNR publications are retrievable on-line at the following URLs: 

http://www.ladseb.pd.cnr.it/infor/ontology/ontolocv.htinl 

http://saussure.innkant.nn.cnr.it 


Aldo Gangemi, ITBM-CNR, Rome, Italy, aldo@saussgre Jg nk^t.rm.cn .. q 
Stefano Borgo, Indian University, Bloomington IN, USA, stborqo@indiana : e du 


120 



Aggregate 

Dependent_Region 

Amount of matter 

opening_3 

body substance 

excavation$hole_in_the_ground 

chemical_clemcnt 

... 

mixture 


compound$chemical_compound 

Quality 

mass 5 

positionSplace 

fluid 1 

tirne_interval$interval* 

Arbitrary collection 

cbromatic_color 

Object 


Extensional Body 

Abstraction 

blackbody$full_radiator 

Abstract_Entity 

body_5 

cognitionSknowledge 

universeSexistenceSnaturcScrcation 

structure 

Ordinary _Object 

statement, 1 

collcctionSaggregation 

proposition 

biological_group 

symbol 

social_group 

set_5 

kingdom 



Quality_Space 

gcographical_object 

space_l 

Body Of WaterSWater 

time_l 

LandSDry LandSEarthS... 

timejntervalSinterval* 

bodySorganic_structurc 

chromatic_color 

artifactSartefact* 


Life ,FormSOrganismSBeingS... 



Event 

Feature 

Event 1 

Relevant_Part 

Phenomenon 1 

edge_3 

State, 1 

skin_4 

ActSHuman_Activity 

paringSparings 

Cognitive,event 


Figure 2. WordNet cleaned up: mapping WordNet into the OntoClean top-level 


121 




Chinese Characters and Top Ontology in EuroWordNet 

Shun Ha Sylvia Wong 
Karel Pala 


Abstract 

In this paper we continue the work by Wong & Pala (2001) and compare 
a selected collection of Chinese radicals, Chinese characters and their meanings 
with Top Ontology (TO) developed in the framework of EuroWordNet 1, 2 project 
(EWN). The main attention is paid to the exploration of the way(s) concepts are 
organized in the Chinese language and how such organization differs from and is 
similar to that in EWN TO. The two main issues examined are: how the basic 
concepts in the EWN 3rdOrderEntities are represented in Chinese and the domains 
in which concepts are grouped under each Chinese radicals. The result of the 
present study sheds light on how to improve the organization of basic concepts in 
existing ontologies. We discuss what potential implications this organization may 
have on the future development of EWN. 

1 Introduction 

Amongst the recent developments in natural language processing (NLP), the problems 
related to the organization of lexical meanings for knowledge representation have 
received a considerable amount of attention. This facilitated the development of 
formalized lexical databases such as WordNet 1.5 (Miller et al. 1990), CyC (Lenat 
& Guha 1990), HowNet (Dong & Dong 1999) and EuroWordNet 1, 2 (EWN) 
(Vossen et al. 1999). These databases share one common feature - they all contain 
some hierarchy of language-independent concepts which reflects the important semantic 
distinctions of each concept. They differ in the way in which they organize such 
concepts within the hierarchy. While such formalized lexical databases are known to be 
well-defined hierarchical systems, there remains doubt as to how faithfully and accurately 
such artificial constructs model the complicated, groupings of real-world concepts. There 
is also doubt as to whether the arbitrariness of such artificial constructs would hinder 
their effectiveness in knowledge representation. In order to investigate the seriousness 
of the implication and effect of these doubts, we studied how concepts are organized 
in the Chinese language and compared the result with EWN TO. This has been done 
in the hope that new improved ways to knowledge representation can be derived as a 
result of this study. 

LI Organization of Concepts as Ontologies 

In the field of NLP we can come across Ihe artificial constructs called ontologies 
that have been developed recently within projects like EuroWordNet 1, 2 (Vossen 
et al. 1999), or systems of CyC type (Lenat & Guha 1990). These ontologies 
were built with the primary purpose to serve as the lexical databases for knowledge 
representation systems. In the current version of EWN TO (v.l), three types of entities 
are distinguished at the first level of TO (Vossen et al. 1999): 

• 1st Order — any concrete entity publicly perceivable by the senses and located at 
any point in time, in a three-dimensional space, e.g. individual persons, animals 
and more or less discrete physical objects and physical substances. They are 
always denoted by (concrete) nouns. 



• 2nd Order - any Static Situation (property, relation) or Dynamic Situation, which 
cannot be grasped, heard, seen, felt as an independent physical thing. They occur 
or take place rather than exist, e.g. continue, occur, apply, and also events, 
processes, states-of-affairs or situations that can be located in time belong here. 
They can be expressed by nouns, verbs and adjectives. 

• 3rd Order -- unobservable propositions which exist independently of time and 
space. They can be true or false rather than real. They can be asserted or denied, 
remembered or forgotten, e.g. ideas, thoughts, theories, plans, hypotheses, reasons, 
and they are always expressed by (abstract) nouns. 

The concepts in the IstOrderEntities have been dealt with by Wong & Pala (2001). In 
the present study, we focus on the 3rdOrderEntities. 

1.2 Chinese Characters 

It is well-known that Chinese script has originated from picture-writing (i.e. representing 
objects and concepts encountered in everyday life by pictures). However, not all 
Chinese characters 1 are pictographs. In fact, only about a couple hundreds of them are 
really pictographs (Harbaugh 1996). According to the etymological dictionary written 
by Xu Shen around 100 A.D. 2 , Chinese characters can be divided into six groups 
(Harbaugh 1996, Lu 1998, Baker 1998): 

1. pictographs (« 4%): represent real-life objects by drawings, e.g. the ancient form 
of 0 (jmw) resembles 0 and -f is a pictograph of a sheep with horns, 

2. ideographs (« 1%): represent positional and numeral concepts by indication, 
e.g. {one), ~ (/wo), -t- {up), T {down), 

3. logical aggregates (« 13%): form a new meaning by combining the meanings 
of two or more characters, e.g. 'Ji {sharp) is formed from 'h {small) to k {big), 
putting Q {sun) and ft {moon) together forms {bright), 

4. phonetic complexes (« 82%): form a character by combing the meaning of one 
character and another character which links to the same sound, e.g. {moth) 
is formed by k {insect) - the meaning component, and {1/me) - the sound 
component, 

5. associative transformations (a small proportion): extend the meaning of a character 
by adding more part(s) to the existing one, e.g. A {emperor’s court) was transformed 
from 24 {emperor’s court, courtyard) by adding P {shelter, home/house), 

6. borrowings (a small proportion): to borrow the written form of a character with 
the same sound, e.g. % {mile) originally means a town/small district/village. 

There are roughly about 50,000 characters in the Chinese script, but an average 
educated Chinese only knows roughly about 6,000 characters (Harbaugh 1996). However, 
this rather limited knowledge of the Chinese script does not necessarily hinder a Chinese 

'As the Chinese script evolved in time, the original pictures had been simplified to ‘written graphs’ 
(Lu 1998) comprising more or less straight lines. Thus, it is perhaps more precise to use the term 
‘graph’ instead of ‘character’. 

2 In this dictionary, Xu Shen included only 9,353 characters. 


123 



in acquiring the meanings of new characters. One reason for this is that many Chinese 
characters (especially the more complicated ones) are formed by combining two or 
more simpler characters and at least one of such components sheds some light on the 
meaning of the resulting character. Thus, the knowledge of a few thousands characters 
allows a Chinese to deduce the meaning of a previously unseen character 3 . For instance, 
when a Chinese encounters a set of unknown characters like: M, $f, M, $£, &J, 

knowing that means ‘a fish’, he/she can deduce that these characters probably mean 
some kinds of fishes (in fact, they all are names of fishes). 

Unlike many scripts which represent a word as strings of letters, Chinese characters 
cannot be listed in a dictionary according to some alphabetical ordering of letters. 
Instead of grouping characters by their first stroke (which is the smallest component of 
a Chinese character), Chinese groups characters according to a collection of Section 
Heads (literally: -^IT), i.e. Chinese radicals.. The primary aim of this grouping is to 
allow a logical organization of Chinese characters within a dictionary. There are 213 
radicals in Chinese. 

Characters are not grouped under each Chinese radical arbitrarily. In most cases, 
a character is grouped under a certain Chinese radical if its concept ‘relates to the 
concept represented by the Chinese radical in some way. Though a Chinese character 
can contain more than one shape which is among the radicals, each character is 
grouped under only one radical. It is also necessary to realize that most Chinese 
radicals serve a similar function as individual words or meaningful morphemes in other 
natural languages. Thus, to a certain extent, we can view them as lexemes. 

2 Chinese Data and EWN 3rdOrderEntities 

The concepts in the 3rdOrderEntity list are very abstract and, in a sense, fairly difficult 
to grasp. Most, if not all, concepts in this entity list display a propositional nature. 
A direct way to represent abstract ideas is to express them in the form of sentences. 
Wong & Pala (2001) have shown that no direct correspondence can be found between 
Chinese radicals and the concepts in the list. It appears that the Chinese counterparts 
of these concepts tend to be represented by more complicated Chinese characters. In 
order to understand how such abstract propositional concepts are represented in Chinese 
(in the hope that this could help to organize the concepts in the 3rdOrderEntity list 
more systematically), we studied the Chinese counterparts of these concepts. 

For each of the basic concepts (BCs) in the 3rdOrderEntities stated by Vossen et al. 
(1999), we looked up their Chinese counterparts from the Oxford Advanced Learner’s 
English-Chinese Dictionary (Hornby 1984). The list of Chinese words includes some 
synonyms, hyperonyms and/or meronyms of those basic concepts. Using two web-based 
Chinese dictionaries (Harbaugh 1998, Muller 2000), we then studied the meaning of 
the individual characters which form part of each word 4 . More Chinese words which 
correspond to the basic concepts in the 3rdOrderEntity list were found and analyzed. 
The result of this study is as follows: 

1. theory (3 senses in EWN): 

3 One would expect that this kind of deduction can be made more effective when the ‘new’ character 
is encountered in a context. 

4 While a Chinese character can be a word in i\s own right, most Chinese words are formed by 
concatenating more than one Chinese character. 


124 



ilwi 5 ( opinion/theory/discussion ) 

V$%j {opinion/theory/discussion + theory/to explain/to say - reasoned suppositions put forward 
to explain facts or events) 

(study/theory/discipline + theory/to explain/to say - reasoned suppositions put forward to 
explain facts or events) 

(stable + logic/reason/theory = theorem) *» 

(study/theory/discipline + logic/reason/theory = general principles of an art or science) 
S (origin + logic/reason/theory = principle) 

S (real + logic/reason/theory - truth) 

(road/way/doctrine + logic/reason/theory = 

(logic/reason/theory + opinion/theory/discussion = theory) 

2. reason (5 senses in EWN): 

(logic/reason/theory + ./rcm = reason) 

(logic/reason/theory + to separate/to solve/to relieve = comprehension) 
(logic/reason/theory + nature/disposition/sex = rational) 

(hem/cause/reason + cause/reason/previous - cause) 

Jfc ik. (origin + cause/reason/previous = reason) 

Jfc IS1 (origin + reason/cause/beacuse = reason) 

3. hypothesis (2 senses in EWN): 

'fS.lSl (fake/pseudo/to borrow + to establish/to set up = assumption) 

HL'/i: (theory/to explain/to say + law/method = way of speaking) 

(breast/thought/one's inner mind + theory/to explain/to say - sth conjectured and not 
necessary based on reasoning ) 

^ <?'] (breast/thought/one s inner mind + to measure/to fathom/to predict sth conjectured and 
not necessary based on reasoning) 

$'] (to push/to procrastinate + to measure/to fathom/to predict = conjecture) 

^ JL (meaning/idea/intention + see/meet/opinion = opinion) 

4. idea (4 senses in EWN)/thought (4 senses in EWN): 

(master/main + meaning/idea/intention •= idea) 

(to observe/view + to remember/to study =* idea) 

(to observe/view + point = viewpoint) 

(to think/to consider + law/method = view/opinion) 

(construct/form + to think/to consider = idea/plan/scheme) 

(to think/to contemplate + to think/to consider = thought/ideology) 

(to think/to contemplate + tie = thinking/thought) 

0 (eye/category + clear/accurate/of = objective/goal/purpose) 

^ (meaning/idea/intention + see/meet/opinion = opinion) 

(logic/reason/theory + to remember/to study = idea/concept) 

5. structure (4 senses in EWN): 

44^. (construct/form + to make = structure) 

(to tie/knot/to congeal + construct/form = structure) 

(to build/to erect/to construct + to make + law/method = construction method) 
(foundation/basis + foundation/base = foundation/base) 

#5- ^ (pattem/standard + pattem/model/type - format/pattem/form) 

6. evidence (4 senses in EWN): 


125 



(evidence/to testify/to proof + to seize/according to = evidence) 
a ^l ( evidence/to testify/to proof + bright/clear/understand * proof) 

{track/trace + elephant/image = sign/mark/indication ) 

{scar/mark + track/trace = trace/mark) 
ffj (shape + track/trace = trace ) 

7. doctrine (2 senses in EWN): 

( teaching/to teach/religion + principle/morality/basic truth/meaning - doctrine) 
■i-Jk {master/main + principle/morality/basic truth/meaning ** doctrine/idealogy) 
(study/theory/discipline + theory/to explain/to say - study) 

8. policy (3 senses in EWN): 

( govemment/adminstration/politics + to urge/strategy = policy) 

7j (square/a place/direction/side/method +■ needle = direction) 

7T S ( square/a place/direction/side/method + legal case/record/plan = project/plan) 

& %. {law/method + legal case/record/plan = bill) 

(Jaw/method + system/to overpower/to control = legal system) 

& $ (law/method + law/rule = law) 

{rules/regulations + law/rule = rule) 

9. content (7 senses in EWN): 

J*1 % {inside + to contain/countenance = content) 

{to need/important + principle/morality/basic truth/meaning = main meaning) 
{to become/to complete/to accomplish + to distribute/to distinguish = 
tion/component/ingredient) 

10. procedure (4 senses in EWN): 

%%-Jt {procedure/journey + sequence = process/procedure) 

11. concept (1 sense in EWN): 

{to observe/view + to remember/to study = idea) 

{to estimate/overall + to remember/to study - concept/idea) 
{logic/reason/theory + to remember/to study = idea/concept) 

12. plan/plan of action (3 senses in EWN): 

{to calculate/plan/scheme + to draw/a drawing = plan) 

7) {square/a place/direction/side/method + toga/ case/record/plan = project/plan) 
{to establish/to set up + to calculate/plan/scheme = design) 

13. communication (2 senses in EWN): 

(to spread + to dye - infection) 
m {to spread + to hear/to smell = rumor) 

{passable/to move unobstructed + information/to ask = communication) 

14. knowledge base (1 sense in EWN): 

(to know/knowledge + to know/knowledge = knowledge) 

*8? (to digest + breath/news = information) 

W. {see/meet/opinion + to hear/to smell - experience) 

15. cognitive content (1 sense in EWN): 

(to admit/to recognize + to know/knowledge = cognition) 

{to admit/to recognize + to know/knowledge ** knowledge) 
{study/theory/discipline + to know/knowledge - learning/knowledge) 
ft* {to observe/view 4- to sense/sense ~ impression) 


composi¬ 


ng 



16. know-how (1 sense in EWN): 

fib (skill/technique + ability = technique/mastery) 

{skill/technique + method/technique = technique/'technology) 

■til *5 ( skill/technique + skillful = technique ) 

2T& (square/a place/direction/side/method + law/method — method) 

17. category (2 senses in EWN): 

(to divide + category/kind = category) 

(to cultivate/type + category/kind « type/kind) 

18. information (3 senses in EWN)/data point (1 sense in EWN): 

'jftt (capital/fund + to expect/material = information/data) 

‘jf'lX (capital/fund + information/to ask = information) 

(feeling/love/situation + report = information/intelligence/report) 
ifi,& (information/to ask + breath/news = message) 

& (to vanish/to disappear + breath/news = information/news) 

48.4" (report + to tell = report) 

(to know/knowledge + to know/knowledge * knowledge) 

19. abstract (info) (2 senses in EWN): 

(big/large + rope for pulling net = outline) 

(big/large + to need/important = abstract/main aim) 

(to pick + to need/important = summary/abstract) 

For each of the concepts in the 3rdOrderEntity list, Chinese has a vast collection of 
terms for representing it. The above data present the more popular subset of terms 
used nowadays. 

The displayed data show clearly how an abstract concept is formed in Chinese and 
how the meaning of the components (in terms of characters) of each word interact with 
each other to form the required meaning. Notice that the above Chinese words are not 
mainly made up of unique characters, but of a small subset of characters. Many of 
the characters in this subset could interact with each other to form a related or very 
different concept, e.g.: 

• (fake/pseudo/to borrow + to establish/to set up «= assumption) versus 
itif (to establish/to set up + to calculate/plan/scheme = design ), and 

• 3$Life (logic/reason/theory + opinion/theory/discussion = theory) versus 
££. dj (logic/reason/theory + from = reason) versus 

(logic/reason/theory + to remember/to study = idea/concept). 

Figure 1 shows the Chinese characters forming words that can be grouped under 
different EWN basic concepts. 

Though this feature, namely sense transfer , holds across the collection of 3rdOrder- 
Entity basic concepts, it is more obvious among the Chinese words which are grouped 
under the same basic concept. For instance, the characters & (logic/reason/theory), * 
(opinion/theory/discussion) and ii (theory/to explain/to s<ty) appear more often under theory, 
the characters & (to think/to consider) and & (to think/to contemplate) show up more often 
under idea/thought and the same for & (to remember/to study) under concept. 

This phenomenon exposes an obvious inadequacy of existing ontologies like EWN 
TO: though the top-down, hierarchical and logical organization of concepts allows the 
retrieval of the foreign language counterpart(s) of a word, it lacks the dynamics which 





(lheory/io exploln/to say) 
(to eslabllsh/lo set up) 
(to know/knowltdgt) 
(stiufy/lhcory/dljclpllne) 
(logtc'rtason.-'theory) 

(origin) 

(law/mtthod) 
(masltr/maln) 

(la obsentMeui) 

(lo remtmber/io study) 


•H 

P 

O 





P 








p 

0 




A 








0 

•H 



to 

Di 









P 



•H 

0 

0 





a) 


c 

id 



V) 

0 

w 

<1) 

0) 



p 


id 

0 



<u 

A 

P 

o 

c 


P 

p 

p 

H 

•H 

>» 

c 

A 

P 

p 

c 

•H 

>i 

C 

V 

a 

04 

c 

P 

0 

M 

N 

0 

ID 

P 

o 

a) 

0) 

0) 


3 

O 

CO 

0 

id 

p 

T3 

P 

•H 

■P 

o 

o 

c 


0) 

id 

a 

0) 

p 

•H 

0 

rH 

c 

0 

c 

id 


A 

d> 

X 

•d 

p 

> 

o 

o 

O 

p 

0 

H 

0 

P 

p 

A 

•H 

w 


*0 

& 

o 

04 

u 

a 

o 


v' >/ V 

V V 


y/ yj 

y/ V y/ 

>/ y/ 

%/ -j >/ -y 

%/ v> 

yj 

V 


y/ 


y/ 

y/ 


(brealh'ntws) 

(prlnclple/morality/baslc Irulh/meanlng) 


y/ y/ 


(U 4-> 


id o 

A 0 


a) 

a) 



O' 

> 

2 

>1 

•a 

•H 

o 

P 

a) 

P 

A 

0 

i—i 

•P 

1 

O' 

2 

C 

2 

ai 

0 

O' 

0 

P 

G 

0 

G 

<d 

X 

o 


o 


y/ y/ y/ 


y/ 


✓ 


V 


(iquart/a ploct/dlrtclion/sldt/mtlhod) 
(legal cast/rtcordfplan) 

(to nted/lmportant) 

(lo dislribult/lo distinguish) 

(lo hearAo smell) 

(lo btow/knowUdgt) 


y/ 

yj 

yJ 

yj 


y/ y/ 

yJ 


yj yJ 


V 



y/ 


Figure 1: A diagrammatic view of Chinese characters used in different 3rd0rderEntity 
basic concepts 


128 


information/data point 



allows one to combine related primary concepts to form secondary concepts. Thus, if 
a term is not stated in the network, there would be no means to derive the meaning 
of this term. However, in some cases, even in English and other European languages, 
it is possible to derive a previously unseen term by looking at its morphemes, e.g. in 
English: care-free, side-light, un-think-able and in Czech: uc-i-t-el (a root denoting the 
concept 'teach' + a verb-making affix + an infinitive affix + an agentive suffix = teacher ), etc. 

When we look at the corresponding radicals of the characters *£, ik, 

4 and & (i.e. the radical $ (jade, gem stone) for *£, the radical (language/dialect, 
speech) for #/&, the radical * (heart) for &/&/& 6 ), we can observe a very interesting 
property, i.e. the realization of abstract idea/concepts by means of concrete physical 
objects. This is a parallel to a well-sought-after NLP topic - metaphoric way of 
speech. Rather than interpreting metaphors on a sentence level, it reflects metaphors on 
a word-meaning/concept level. While understanding how metaphoric speech could be 
interpreted appropriately to extract the richer intended meaning, this study suggests that 
looking at the way ‘metaphors’ are used in a concept or word sense level might help 
us to formalize how an abstract secondary concept could be deduced from a physical 
primary concept, if such regularity does exist. In our opinion, we have to follow this 
line of thinking in the future research to come in order to obtain a better understanding 
of the metaphorical processes. 

Thanks to the ideographical nature of Chinese characters, the meaning transformation 
in Chinese seem to be better visible. The same phenomena can be observed in other 
languages as well. For instance, compare the English verb put and its Czech counterpart 
polozit -- they both display a standard case of sense transfer. 




From 

To 

English 

put 

‘put a book on the table ’ 

‘put stress on a question ’ 

Czech 

polozit 

‘polozit knihu na stul' 
put book on table 

‘polozit duraz na tu otazku' 

put stress on that question 


However, we are aware that such phenomena can be rather difficult to trace in many 
natural languages. From the above data, it appears that the unique way of the evolution 
of Chinese script makes the study of such phenomena easier because they are more 
traceable in Chinese. 

3 The Chinese Way to Represent Concepts 

Wong & Pala (2001) have observed that Chinese seems to organize concepts in a 
contextual manner, with each Chinese radical serving as the characterizing basic concept 
in the respective context. To study how this organization works, we have picked 
seven Chinese radicals which match the basic concepts in the EWN IstOrder and 
2ndOrder entities and studied the more commonly-known subset of their subsuming 
characters (roughly 175 characters), as listed in (Wah Tung Committee 1983, Harbaugh 
1998, Muller 2000). The result of the observation shows that many of the characters 
subsumed in these radicals can be classified along five main lines (as illustrated in 
Figure 2). The following shows a reduced version of such classifications: 

• 1 stOrderEntities -> Origin -» Natural -> Living Animal: A (fish, to fish) 

6 Chinese regarded the heart, rather than the brain, as the body part which performs thinking and 
experiences mental states. This assumption is, to some extent, shared by Czech as well. 


129 




concept 



Figure 2: A new way for organizing concepts — a schema 

- as an object 7 : % (shark), ft (carp), M (salmon), ft (tuna), ft (flounder), ft ( eel), 
$$ (cod) 

- as a property: S' (stupid), ft (fresh, tasty) 

- as an typical event (situation, process): ft (fish bone stuck in one's throat) 

- its component 8 : If (fin), ft (fish scale), ft (fish bladder ), ft (gill) 

• IstOrderEntities —> Composition —> Part: S (a tooth, the upper incisors) 

- as an object: M/M (crooked teeth), (3rd generation teeth), M (decaying teeth) 

- as a property: jbk (uneven (teeth)) 

- as an typical event (situation, process): (to change teeth), (to bite), M 

(to grind teeth) 

- its component: #/#r (gum) 

• 2ndOrderEntities —> SituationType —> Dynamic UnboundedEvent: if (to 
walk) 

- as an object: (street), ft (avenue), W (alley), (cross road, arterial road), #1? 

(lane, alley), 

- as an typical event (situation, process): if 9 (to flow, to overflow), ft 10 (to show 
off), ft (to charge, to rush), ft (to protect, to guard) 

• 2ndOrderEntities —> SituationType -> Dynamic -> UnboundedEvent 11 : 

(fire) 

- as an object: ‘$L (stove, o'ven, furnace), fa (stove), fa (a brick bed warmable by fire), 

(candle), ^ (lamp), (cannon), fa (wick), (charcoal) 

7 Note that ‘fish’ in here does not restrict to the meaning used in the biological classification. The 
radical also subsumes characters which means some kind of marine creatures, e.g.: ft ( abalone ), ft 
(whale) and even (crocodile). 

8 This line corresponds to how these concepts are captured by meronymy/holonymy relation in EWN. 
9 if is composed of water walks -ft (Harbaugh 1998) 

l0 ft is composed of to cover what is already obscure/tiny and to walk/proceed if (Harbaugh 1998) 
51 ik. (fire) also appears as 2ndOrderEntities SituationComponent Cause -> Phenomenal 


130 



- as a property: & {hot), fri (bright), 9k (glorious), $ (overcooked), £ (bright, 

brilliant ), & (< dazzling ), (/war) 

- as an typical event (situation, process): (to roas/), (to fry), 

(to cook), ft (to cremate), #■ (to JoWer), & (to extinguish), (to burn) 

- as an consequence: & (as/i, grey), $£ (smoke), & (charred), & 0*^0 


When wc look closely at the concepts in each of the above five classes and their 
corresponding Chinese radical (e.g. ft), we can see that the basic concept (i.e. to walk) 
represented by the Chinese radical projects itself along one of the main lines shown 
in Figure 2 (e.g. as a physical object) and transforms itself into a related concept 
(e.g. street, avenue, alley, etc.). For instance, through realizing ‘fire’ in the form of 
an object, we get ‘A (stove, oven, furnace), (candle) and even (charcoal), when we 
think about ‘fire’ as the vital participant of a typical event (situation, process), we get 

(to roast), ft (to cremate) and even & (to extinguish) 12 . While the above classification 
suggests a natural way to organize concepts, it also gives us an insight into how we 
can relate a concept in the IstOrderEntity list to another concept in the 2ndOrderEntity 
list and vice versa. 

The presented Chinese data allow us to draw the following conclusions about the 
organization of lexical/conceptual knowledge. Such knowledge can be centered around 
the relevant concepts in the form of a small semantic network. In general, such 
structure may look like Figure 2. Having the schema shown in Figure 2, one can form 
an instance of it with the particular concept, e.g. ‘fire’. Notice that this schema need 
not always result in a tree structure, but, in typical cases, it would rather result in a 
semantic network when the concepts in each instance of the schema interact with each 
other in a more complicated way. We are aware that the realization of this schema 
calls for further explorations. The organization of conceptual knowledge in this way is, 
when compared with EWN TO, certainly richer, more complete and more transparent 
as the organization of concepts is not centered around the syntactic categories which 
the concepts are verbalized, but a semantic context which each concept is derived 
from. In our opinion, it better reflects the ways in which humans organize and process 


conceptual knowledge. ... 

The presented analysis suggests that existing ontologies (e.g. EWN TO) could be 
enriched in the illustrated way and organized in a more structured manner while 
keeping some existing features of them, e.g. relations of synonymy/antonymy and hy- 
ponymy/hyperonymy. It can be seen that new semantic relations may be systematically 
added which also enable the formulation of richer inference rules. That should con¬ 
siderably improve the power and applicability of the existing knowledge representation 


resources (databases, lexicons, etc.). 

We can argue that the relevant evidence comes from the fact that Chinese radicals 
play a specific role among Chinese characters. Their basic nature supports our hypothesis 
about the organization of conceptual knowledge. In this respect, we think, it is not an 
issue that can be confirmed by more data. The essential fact is that Chinese radicals 
allow for such natural and consistent interpretation. The presented comparison with the 
constructs that exist in EWN TO also points to this direction. 

At this stage of the research we are aware of the fact that we do not know enough 
about the organization of concepts in human brains because the relevant psychological 


'’Notice that, as each Chinese radical corresponds to a unique basic concept, not all of them can be 
realized in one of the above five contexts. Thus, in the above Chinese data, some Chinese radicals do 
not have a correspondence for certain classes. 



evidence cannot be accessed readily. What we rely on, however, are the means offered 
by natural languages (Chinese in particular) which helps us to understand how we 
shape our perception of the real world. Understanding how concepts are organized 
through the language expressions yields us quite firm grounds on which we are able to 
approximate how our minds organize concepts naturally. 

4 Conclusion 

Continuing the research started by Wong & Pala (2001), we studied the organization of 
concepts into ontologies. The ontology we selected as an example is Top Ontology (TO) 
as defined and used within the EuroWordNet (EWN) 1, 2 project. We have compared 
the organization of concepts in EWN TO with the ways concepts are organized in a 
natural language, particularly Chinese. 

This time we explored the relations between 2nd- and 3rd-OrderEntities as defined 
in TO with the corresponding Chinese radicals and characters. The results of the 
comparison appear to be both interesting and promising ~ Chinese data offer some new 
views of concept organization that, as we hope, could systematically enrich the existing 
EWN TO and make it more natural and structured. As we have suggested the origin 
atomic concepts in the present TO could be viewed as graph structures for capturing 
the relations (or associations) between concepts in a way that, we hope, is more similar 
to how a human brain processes conceptual knowledge. This certainly opens a door for 
a better formulation of basic inference rules; which would make a more realistic and 
intelligent reasoning possible. 

References 

Baker, M. A. (1998), ‘Mandarin Chinese outpost — characters’, [Online]. Available at: 
http://chinese_outpost.tripod.com/chars.html [2001, March 21]. 

Dong, Z. & Dong, Q. (1999), ‘HowNet’, [Online] Available at: 
http://www.keenage.com/zhiwang/e_zhiwang.html [2001, June 7]. 

Harbaugh, R. (1996), ‘Zhongwen.com - Chinese Characters and Culture’, [Online]. 
Available at: http://www.zhongwen.com/ [2001, March 19]. 

Harbaugh, R., ed. (1998), Chinese Characters: A Genealogy and Dictionary, Han Lu, 
Taipei. Also appeared in: [Online], http://www.zhongwen.com/rn/search.htm [2001, 
March 19]. 

Hornby, A. S., ed. (1984), Oxford advanced learner’s English-Chinese dictionary , 
Oxford University Press; Keys Publishing, Hong Kong. 

Lenat, D. & Guha, R. (1990), Building Large Knowledge-based Systems — Representa¬ 
tion and Inference in the CyC Project , Addison Wesley. 

Lu, A. Y.-C. (1998), Phonetic Motivation -- A Study of the Relationship between Form 
and Meaning, PhD thesis, Department of Philology, Ruhr University, Bochum. 

Miller, G. A. et al. (1990), Five papers on WordNet, Technical report, Princeton 
University. CSL Report 43, Cognitive Science Laboratory. 


132 



Muller, A. C. (2000), ‘Dictionary of East Asian literary terms’, [Online]. Avail¬ 
able at: http://www.human.toyogakuen-u.ac.jp/ acmuller/dicts/dealt/index.htm [2001, 

March 19]. 

Peterson, E. (2000), ‘Chinese character dictionary’, [Online]. Available at: 
http://www.mandarintools.com/chardict_rs.html [2001, March 19]. 

Vossen, P. et al. (1998), The EWN base concepts and top ontology, Technical 
report, University of Amsterdam, Amsterdam. Deliverables DO 17, D034, D036, 
EuroWordNet, LE2-4003, Final Version. 

Vossen, P. et al. (4999), Final report on EuroWordNet 2, Technical report, University 
of Amsterdam, Amsterdam. [CD ROM]. 

Wah Tung Committee, ed. (1983), (a Chinese word dictionary), Wah Tung, Hong 
Kong. [In Chinese]. 

Wong, S. H. S. & Pala, K. (2001), Chinese Radicals and Top Ontology in WordNet, 
in ‘Text, Speech and Dialogue—Proceedings of the Fourth International Workshop, 
TSD 2001, Pilsen, 10-13 September 2001’, Lecture Notes in Artificial Intelligence, 
Subseries of Lecture Notes in Computer Sciences, Faculty of Applied Sciences, 
University of West Bohemia, Springer, Berlin. 


Contacts 

Authors: Shun Ha Sylvia WONG 

Addresses: Computer Science 
Aston University 
Aston Triangle 
Birmingham B4 7ET 
United Kingdom 

E-mail: s.h.s.wong@aston.ac.uk 


Karel PALA 

Faculty of Informatics 

Masaryk University 

Botanickd 68a 

Brno 60200 

Czech Republic 

pala@informatics.muni.cz 


133 



Evaluation of WordNet-Like Lexicon 

Jiangsheng Yu 


Abstract 

Abstract The various specific applications of the WordNet-like lexicon in NLP, un¬ 
doubtedly from the viewpoint of Computational Lexicology, require the diversification of 
its semantic representations. That is, the structure of the concept net must be changed 
according to a precise purpose sometimes. Since one can delete a node or subtree from 
the original lexicon, with the evolution of the WordNet-like lexicon, an approach to some 
analysis of its knowledge structure seems necessary. The author defined the degree of 
structural destruction for the evolution of a WordNet-like lexicon as an explicit standard 
to restrict the variation of knowledge structure, which is related to the well-defined deduc¬ 
tive rules in the lexicon. Lastly, the visualized auxiliary construction of Chinese Concept 
Dictionary was introduced as an approach to a WordNet-like lexicon. 

Keyword concept, WordNet, destruction degree, evolution 


Introduction 

Semantic Knowledge Representation seems more and more imperative in Computational 
Linguistics from last decade. Now, there are various semantic lexicons (or lexicons with¬ 
out any misunderstanding) with distinct structures and purposes, such as, WordNet, 
MindNet, FrameNet, HowNet, CCD 1 , etc. The structure of a lexicon depends on the 
definition of meaning 2 . In general, given a finite set of words, E, in a natural language 
£, what we can do. in a lexicon is just to list all the possible meanings represented by 
the elements in £, with some well-defined structures which describe the semantic knowl¬ 
edge . Thus, for both linguistics and computational linguistics, it’s necessary to clarify 
the definitions of concept (or meaning) , sense and the relationship between them. The 
answer of WordNet 3 in Princeton University is mathematically formalized in Figure 1. 

WordNet is well structured to a semantic network by the hypernymy relation de¬ 
scribing an inheritance hierarchy, and other subordinate relations such as oppositive 
relation, backward presupposition, etc. The deductive rules on WordNet is quite sim¬ 
plified, which could be easily applied in the semantic analysis of natural language pro¬ 
cessing (NLP). The fact of EuroWordNet (see [19] and [7]) illuminates the wide accep¬ 
tance of WordNet-like framework for lexical semantics. As a static lexicon, WordNet 
just represents a single ontology that the psycholinguists are interested in. Actually, 
the structure of the so-called common knowledge is nothing but a statistical distribu¬ 
tion, which is effected by the cultures and personal experiences. Oriented to a specific 
application, such as IE, the appropriate information in a WordNet-like lexicon seems 

1 Chinese Concept Dictionary, a Chinese WordNet-like lexicon, developed by the Institute of Compu¬ 
tational Linguistics (ICL), Peking University. More details in [20], [22], [25], [26] and [27]. But neither 
the improvement of CCD in sentence frame (i.e., the longest chunk sequence with closed semantic con¬ 
straints from noun concepts) nor the semantic relationship between brother nodes in a hypernymy tree 
is the topic in this paper. 

2 In Grundlagen der Mathematik (1884), FYege argued that the sense of a word should be determined 
in a context, not isolation. 

3 More exhaustive description with applications could be found in [6], [2], [5], [11], [12], etc. 



Questions 

Answer of WordNet 

What’s the concept in £? 

A concept in £ is just a set of synonyms 
(SynSet), which is determined by the Principle 
of Substitution. 

Vtu € E, what’s the meaning 
of w? 

The word sense of w is A(u;), the set of all the 
SynSets containing w. 

What’s the relationship between 
concept and word sense? 

Although there is no mapping from the set of 
concepts, F, to the set of word senses, A, but 
if C e A(u>i) nA(w 2 ), then CGT and {wi,W 2 } 

C C. 


Figure 1: Axiomatic Viewpoints of WordNet 

necessary. For example, C = { earthquake , quake , temblor , seism} is not only a kind of 
C' = { geological phenonemon }, but also a kind of C" = {natural disaster}. Suppose 
that there are n angles of view, Vi/ifc,*** >v n , then the hypemymy relation could be 
classified into {/ii, h 2 , • • • , h n }, where hi is defined by A geographer prefers C' -< Vi C , 
while the common people think C" -< Vj C more reasonable. To imagine that Vi, V 2 , • • • , v n 
are colored up by n distinct colors, the construction of the hypernymy tree is to endue 
each two SynSets in a category with a colorized arc if necessary. From Vi to Uj, we have to 
operate the lexicon and examine the evolution of the semantic knowledge representation. 

In section 1, we’ll talk about the formal definition of the lexicon, and the problem of 
its evolution will be proposed in the next section. Also, formulae are given to measure 
the destruction degree of the knowledge representation from the original lexicon to a new 
one, which provides some standards of lexicon construction. Without doubt, from the 
viewpoint of Computational Lexicology, the statistical structure analysis makes sense of 
the stability of knowledge accumulation. Because WordNet prescribed the initial hyper¬ 
nymy trees for both noun and verb concepts, based on which the comparison between 
two WordNet-like lexicons becomes possible. Lastly, we’ll report the visualized auxil¬ 
iary construction of lexicon (VACOL) in CCD, which animates our consideration on the 
evolution problem. 


1 Lexicon with Transformations 

Generally, the construction of a lexicon is to induce a structured set Lex from £. In [30], 
Wittgenstein (or even earlier, Leibniz in [9]) believed that the meanings of a given word 
is nothing but its usages in langue. As described in the introduction, each node in the 
hypernymy tree is a concept (i.e., SynSet). The scheme of a node (or its neighborhood) 
indicates some ontology (maybe from the lexicographers), which will be used as the 
semantic knowledge in NLP. Sometimes it’s good, sometimes not enough. Even for a 
lexicographer, his (or her) knowledge representation is hard to keep consistent and stable. 
The larger a lexicon is, the more difficult it is to keep consistency. 

To overcome the natural disadvantage in the construction of a complicated lexicon, 
besides the guarantees from a large-scale corpus and all kinds of applications, 4 we still 
need a structure estimation for the metalanguage of semantic knowledge representations. 

4 Quine preferred the pragmatistic approach to a concept system in [15]. 


135 




It is the nethermost requirement to keep a category labelled tree 5 by the hypernymy 
relation if the hypernym of. any concept is defined uniquely. The generator of CCD has 
some tools to inspect such mistakes 6 by well defined relations between concepts. For 
example, symmetry relation between oppositive concepts requires that there should not 
be any triangular or polygonal oppositive concepts in CCD. More strictly, the mistake 
of knowledge structure, such as {tiger, Panthera tigris} and {big cat, cat} as brother- 
node (i.e., both of them are regarded as the hyponyms of {feline, felid}), could not be 
detected automatically. 

Definition 1.1 A lexicon is a triple Lex = ( S,R,T) satisfying that Vt € T*,S -4 S', 
where S C 2 s is a structured set, R the set of deductive rules on S and T the set of all 
structural transformations of S. Each transformation keeps the type 7 from S to S’. In 
fact, T is a noncommutative group with composition. 

T is closed, which means that the composition of elements in T does not change the type 
of S. By this sumption, the deductive rules on S are not effected by the evolution of 
lexicon. That is, based on distinct ontologies and purposes, the specific representations 
in the lexicon could be different, while the set of deductive rules is determined by the 
structural type of S uniquely. The least set of structural transformations on S includes 
operations in the labelled tree and between the labelled trees, described by: 

1. INSERT a brother-node; 

2. COLLAPSE a non-root node to its father-node, and delete the single node for 
degenerative tree; 

3. ROOT is adding a new root for the original tree; 

4. ADD a relation between labelled trees; 

5. DELETE a relation between labelled trees. 

Obviously, operations in a tree just change the hypernymy relation, and operations be¬ 
tween trees are used for opposite relation, holonymy relation, causal relation, etc. Some 
common operations in a tree, such as delete, cut, copy or paste a subtree can be generated 
by the five operations mentioned above. More details about labelled tree could be found 
in [13] and [14], and what we emphasize is that the five operations are not closed for 
lattice in general. 

For both noun and verb concepts, hypernymy ordering as the main structure incar¬ 
nates the usual way of human being’s concept classification. Although the classification 
may not be a partition, but the degree of a weak partition 8 will provide a direction for 
the lexicon evolution. 

Example 1.1 INSERT and COLLAPSE are inverse mutually: 

6 In WordNet 1.6, there are more than 700 noun concepts with multiple hypernyms in a category. 

6 See the cycle error for the hypernyms of verb concept {reserve, hold, book} in WordNet 1.6. 

7 For instance, labelled tree or complete lattice. 

8 Set P C 2 s is called a partition of set 5 if and only if U P = S and Vo, b £ P, if a ^ b, then oH6 = 0. 
A partition determines an equivalence relation uniquely and vice versa. Set P C 2 s is called a weak 
partition of set S if and only if U P = 5 and Va, b e P, if a ^ b, then ml Jffj | 6 | ) < 1. Obviously, for any 
set 5, its partition is also a weak partition. 


136 


3-"{location (HJF $|0f] 

Draft!] ' 

[there lF-i.1 

[to £ i£ft] 

-JJasJSft AttlCto & BffiU 


B 


3 - Dir* K ft!] 

*3 [point .!$] 

93~lir«gU» BSHptfl 
ft [rrgirr; E$j 

Tj? #83 


sSJfct 


ccJapjt 


[location {IS #$Jr] 

— [here ftl] 

[there ^El 

- Rlrt* £ B&Cl 

[base Wfe^ ]tto$g£#] 

1- nn 

3 [poir.t £] 

2 - [regicn ESliljrart HE] 

@ [regien Eljfc] 

- -[itofiibeuti T?S fjftt] 


|Ganh| 

Hwi 



th* abate rf mortals (as contrast-ad *#ith hwwan or M’ t 

a sciatic ta^ri dslirwdbvar^crmaqna^mklrnanfiicnd extent 

iiiipriir 


‘ftvrashellon^rih" 

ji___ 




Figure 2: INSERT and COLLAPSE operations in a labelled tree 

2 Destruction of Knowledge Structure 

Definition 2.1 Let Q(S) denote the least number of operations constructing S, and 
f l(S —* S') the least number of operations from S to 5'. It’s easy to verify that 

• ' t '•*' .* v' - 

Theorem 2.1 O(-) is a distance, i.e., it satisfies that VS,S',S", 

1. ft(S-4S')>0 2. fi(S-4S') = Q<^S?=S' 

3. a(s -> s') = n(s' -> s) 4 , q(S -4 s") < n(s -> s') + fl(s' -4 s") 

Corollary 2.1 Q(S') < D(S) + £I(S -4 S') 

Proof Remind that Q(S') = 0(0 -4 S') and 0(S) = 0(0 -4 S). 

Definition 2.2 The degree of structural destruction from S to S', is defined by 


p(S -4 S') = 1 - 


0 (S') 

0(S)+0(S-4S') 


( 1 ) 


Property 2.1 0 < p(S 4 S') < 1 

For two ultra cases of p(S -4 S'), if p(S —> S') = 0, i.e. O(S') = 0(S) 4- 0(S —> S ), 
then the operations from S to S' only include ADD, INSERT and ROOT, which do not 
destroy the knowledge structure of S; if p{S -4 S') = 1, i.e. O(S') = 0, then operations 
from S to S' are COLLAPSE and DELETE, which destroy the knowledge structure of 
S completely. In other words, formula (1) provides a standard for lexicon evolution: the 
more p(S -4 S') approaching to 1, the less stable is the evolution of the lexicon. Without 
doubt, the complete destruction of lexicon is not hortative because of the total negation 
of the original knowledge representation. 


137 




Theorem 2.2 Let «5 0 ,5i, S 2 , • • • ,S n be an observed sequence of evolutions from the 
normal distribution N(p, 6) and p { ~ p(Si -* $*+ 1 ), where n is large enough, the maximum 
likehood estimate of destruction degree p and the variance 6 are 


Eft 

i=0 


6 = 


n 


E (ft - p? 


n +1 



What’s more, p and 6 are unbiased minimum variance estimators 9 of parameters p and & 
respectively. These two parameters describe the stability of construction collectively. 

Theorem 2.3 Obviously, p(5 0 ->S»)<£/*i<n+l, thus sometime p = < 1 

t=0 

is called the most conservative estimator of destruction for the lexicon evolution. 


Without generality, if p is beyond the threshold value of evolution, then there 
must be some steps that destroy the knowledge structure too much. If the change leads 
to some much better or worse application result, it means that the structure is sensitive 
for this application. If not, the structure is robust. More interesting, the destruction 
degree provides a method of comparison between the WordNet-like lexicons in distinct 
angles of view (or for distinct applications). 


Definition 2.3 For any w in a given text, the process of Word Sense Disambiguation 
(WSD) is to choose a suitable element in A(w) as ids concept tag. Once the sense of w 
is identified as concept C, the machine looks like understanding the meaning of w by the 
reasonable substitution of its synonyms. 


The traditional semantic tags are from some ontology, the apriority of which is often 
criticized by computational linguists. For us, the empirical method must impenetrate 
each step of WSD because of the complexity of language knowledge. Statistical approach 
to WSD needs a well concept tagged corpus as the training set. Suppose that there has 
been a large-scale concept tagged corpus. To avoid the sparse data problem, the training 
of a conceptJTagSet, T s , from the tagged corpus is to collapseJTs along the hypernymy 
tree. 10 Let T s denote the set of all the nodes in T s , obviously, |Ts©Ts'| < fi(Ts -* T^). 

Definition 2.4 If S -» S', then the difference between Ts and Ts' described by 


t(S -> S') 


] Ts © Ts' 1 

|T s U Ts'l 


(3) 


is called the first TagSet-index of the evolution. And p(Ts T 5 /) is called the second 
TagSet-index of the evolution. If p(S —► S') ^ 0, then called the third 

TagSet-index of the evolution. The evolution is called TagSet-stable if the three indexes 
are enough small respectively. In general, the diagram in Figure 3 is not commutative. 


Property 2.2 If the evolution S S' is TagSet-stable, then the knowledge destruction 
from S to S' does not effect the training of the concept TagSet too much. 


By the three indexes, we are able to confine any evolution to a reasonable one, from which 
the TagSet is trained stably. Now, it’s made dear that an appropriate evolution of the 
WordNet lexicon is restricted by the degree of structural destruction and the stability of 
TagSet training from a large-scale tagged corpus. 

t(Pi-fi)* t(Pi-fi ) 2 
9 Because n is large enough to make —• 

10 In fact, Ts is a subgraph of the WordNet-like lexicon Lex = ( S,R,T ). 


138 




T s 

aKTs -> T5/) i 


T* 


p(Ts~*S)=0 

-> s 

i p(S-+S') 

p( t s ,->-5, ) =0 


Figure 3: Explanation of the third TagSet-index 

3 Visualized Auxiliary Construction of Lexicon 

To verify the consistency of brother-nodes in a hyperrtymy tree, CCD provides a visualized 
tool to help the lexicographers find out the distribution of concepts at the same level (See 
Figure 2). The database is sensitive with the foregroundings. The lexicographers may 
know nothing about how data work actually, the only thing for them is to express their 
semantic knowledge. As the operations done, the history records update. 

We’ve tried two ways to get CCD: one is to construct from the same initial classifica¬ 
tion in WordNet independently, and the other depends on the translation of WordNet and 
its transformation. More than 1,600 concepts were constructed by the former method, 
in which most mistakes distribute on the knowledge representation of brother-nodes. 
The latter method divides the construction into three steps: (1) translate WordNet to 
a bilingual concept lexicon with essential improvement; (2) filtrate CCD from the bilin¬ 
gual lexicon; (3) transform the unshaped CCD to a well-structured one. Now, we get 
more than 18,000 unshaped noun concepts in Chinese from WordNet. Next research 
is the comparison of knowledge representation between English and Chinese concepts, 
especially the evolution. 



[unfoo SSBiSl 
[caarber $$] 

[node 1?] Beifefc ? SI 





Figure 4: CCD generator and browser 


139 






















Being a standard, WordNet provides not only the description method but the existing 
resource, which is a quite pivotal coordination for the WordNet-like multilingual lexicon. 
While WordNet browser does not provide a convenient and intuitive inspection of brother- 
nodes, 11 a more effective approach to CCD is VACOL, which enables a close look at the 
neighborhood of a SynSet in a hypernymy tree, hence an appropriate presentation of a 
concept. 

Example 3.1 There is a (strict) partial ordering in the hyponyms of concept {season, 
time of year}: {spring, springtime} -< {summer, summertime} -< {fall, autumn} -< 
{winter, wintertime}. That is, the brother-nodes are also ordered by time, position or 
other standards. 

The adjustment of nodes in a hypernymy tree by VACOL software does not change the 
other relationships between concepts, which is assisted by the offsets in the database 
of WordNet. To keep the communication 12 in VACOL, all the lexicographers modify a 
unique database in server through local area network. In the sense of common knowledge 
of concepts, CCD just gathers the most frequent Chinese words from the large-scale 
segmented/POS tagged corpus 13 in our institute. As our plan, more than 60,000 noun 
concepts will be finished in two years, and then the verb concepts with the closed semantic 
constraints from the noun ones in the sentence frames. 

4 Conclusion 

Not only for WordNet-like lexicon but also for any other well-structured lexicon, the 
evolution rules are always a part of metarules in the construction, especially for those 
complicated structures. It is hardly complete if no method can be used to distinguish 
Lex* from Lex ; , where Lex* is derived from Lex,-. Further research includes the statistical 
evolution and the automatic knowledge mining from an encyclopedia and a large-scale 
corpus, and also the feedback from the results of some specific application. Now, we’re 
studying the automatic training of the concept TagSets along the hypernymy tree and 
the Hidden Markov Model with two parameters, (i.e., POS and concept), which are the 
main parts of Word Sense Disambiguation (more details in [28]). 


Acknowledgement 

Many thanks to Prof. YU Shiwen for his trust in my understanding of Chinese Concept 
Dictionary (CCD). And also I appreciate all the members participating in CCD project, 
especially Mr. ZHANG Huarui, Ms. SONG Chunyan, Mr? LIU Yang, Ms. BAI Xiao- 
jing, and our advisors of linguistics, Prof. LU Chuan and Dr. ZHAN Weidong. The 
blithesome collaboration with Mr. LIU Dong and Ms. WANG Wei from Pecan Company 
is memorable for all of u§. Thank all my friends in the Second Workshop on Chinese 
Lexical Semantics for their kindly discussion in private with the author. Lastly, the most 
thankful .words are given to my wife for her longtime tolerance, to my weaselling from the 
housework under the false pretense of research. 

n To look up the brother-nodes of C in WordNet, one has to find out the hypernym of C (suppose C') 
firstly, and then check all the hyponyms of C' in the browser. 

i2 For instance, the relationship between the concepts from distinct categories. 

13 One year newspaper — People’s Daily, the moat popular and regular official media file in China. 
Now, we have finished the part of ten months which contains at least 25,000,000 Chinese characters. 


140 



References 


[1] Aristotle. (1941). Categoriae. in The Basic Works of Aristotle, R. McKeon (ed). 
Random House, New York 

[2] Beckwith, R. (1993). Design and Implementation of the WordNet Lexical Database 
and Searching Software , in the attached specification of WordNet 1.6. 

[3] Carnap, R. (1966). Der Logische Aufbau Der Welt , Felix Meiner Verlag, Hamburg 

[4] Carpenter, B. (1992). The Logic of Typed Feature Structures , Cambridge University 
Press 

[5] Fellbaum, C. (1993). English Verbs as a Semantic Net , in the attached specification 
of WordNet 1.6. 

[6] Fellbaum, C. (ed) (1999). WordNet: An Electronic Lexical Database , The MIT Press. 

[7] Huang, C.R. et al (2001). Linguistic Tests for Chinese Lexical Semantic Relations: 

Methodology and Implications , report in the Second Workshop on Chinese Lexical 
Semantics. v ,;./?; 

[8] LakofF, G. (1987). Women, Fire, and Dangerous Things: What Categories Reveal 
about the Mind. University of Chicago Press. 

[9] Leibniz, G.W. (1981). New Essays on Human Understanding , P. Remnant and J. 
Bennett (Ed. and trans.), Cambridge University Press. 

[10] Lucas, W.F. (ed) (1983) Modules in Applied Mathematics, Spring-Verlag New York, 

Inc. ' 

[11] Miller, G.A. et al (1993). Introduction to WordNet: An On-line Lexical Database , in 
the attached specification of WordNet 1.6. 

[12] Miller, G.A. (1993). Nouns in WordNet: A Lexical Inheritance System , in the at¬ 
tached specification of WordNet 1.6. 

[13] Partee, B.H. et al (1990). Mathematical Methods in Linguistics , Kluwer Academic 
Publishers. 

[14] Priss, U. (1999). The Formalization of WordNet by Methods of Relational Concept 
Analysis, in WordNet: An Electronic Lexical Database, Fellbaum C. (ed), The MIT 
Press. 179-196 

[15] Quine, W. (1980). From a Logical Point of View , Harvard University Press 

[16] Rosch, E. (1975). Cognitive Representations of Semantic Categories , in Journal of 
Experimental Psychology, 104, 192-233. 

[17] Russell, B. (1948). Human Knowledge — Its Scope and Limits, Simon and Schuster 

[18] Russell, B. (1989). Logic and Knowledge, Unwin Hyman Ltd 

[19] Vossen, P. (ed.), (1998). EuroWordNet: A Multilinugual Database with Lexical Se¬ 
mantic Networks, Dordrecht: Kluwer. 


141 



[20] Yu, J.S. (2001). Structures in CCD , report of Second Workshop on Chinese Lexical 
Semantics, held in Peking University 

[21] Yu, J.S. (2001). Algebraic Structures in Linguistics , series of reports in the ICl-salon 
of Computational Linguistics, Peking University 

[22] Yu, J.S. and Yu, S.W. et al (2001). Introduction to Chinese Concept Dictionary , in 
International Conference on Chinese Computing (ICCC2001), 361-367 

[23] Yu, J.S. (2000). Hidden Markov Model and its Applications in NLP, report of ICL 
Seminar of Natural Language Processing, Peking University. 

[24] Yu, J.S. (2000). Machine Segmentation Ambiguities and Dynamic Lexicon , in Asso¬ 
ciated Conference AI2000 

[25] Yu, J.S. (2001). Construction of Semantic Lexicon, draft in the author’s homepage 
http://icl.pku.edu.cn/yujs 

[26] Yu, J.S. (2001). Specification of Chinese Concept Dictionary , draft of ICL, Peking 
University, 2000 

[27] Yu, J.S. and Yu, S.W. (2001). The Structure of Chinese Concept Dictionary , accepted 
by Journal of Chinese Information Processing, 2001 

[28] Yu, J.S. and Yu, S.W. (2001). WSD based on Integrated Language Knowledge Base , 
accepted by The 2nd International Conference on East-Asian Language Processing 
and Internet Information Technology (EALPIIT’2002). 

[29] Yu, S.W. (2000) The Comprehensive Chinese Language Knowledge Base and its Ap¬ 
plications in the Teaching of Chinese. Language , in the 4th Global Chinese Conference 
on Computers in Education. 

[30] Wittgenstein, L. (1953). Philosophical Investigations , Basil Blackwell Ltd. 


Jlangsheng Yu, Homepage: http://icl.pku.edu.cn/yujs, vuis@pku.edu.cn . 


142 



EUROTERM 

Extending EWN using both the Expand and Merge Model 

Stamou Sofia 
Ntoulas Alexandras 
Hoppenbrouwers Jeroen 
Saiz - Noeda Maximiliano 
Christodoulakis Dimitris 

Abstract 

EuroTerm aims at expanding EuroWordNet with domain specific terminology for a set of European 
languages. EuroWordNet is a lexical database representing semantic relations among basic concepts for 
West European languages, which are combined with a so-called Inter-Lingual-Index. EuroTerm’s main 
purpose is to combine effectively multilingual domain specific terminology into a common lexical 
database through a Terminology Alignment System, in order to expand EuroWordNet and the Inter- 
Lingual-Index with terms restricted to the conceptual domain of environment. 

1. Introduction 

EuroTerm 1 is an EC funded project (EDC-2214) that aims to extend EuroWordNet (EWN) 
with environmental terminology for the following languages: Greek, Dutch and Spanish. 
EWN is a multilingual database with generic WordNets for eight European languages 
(Vossen, 1996). Individual WordNets incorporated in the central database form autonomous 
semantic networks linked to each other through ari Inter-Lingual-Index (ILI). The aim of 
EuroTerm is to enrich EWN with domain specific teiminology for the above languages. Each 
of the monolingual WordNets will comprise of ~ 1,000 synsets and will be stored in a 
common database, which will be linked to the central EWN database under the domain label 
“Environment”. There are two approaches to build a semantic network, namely the expand 
and the merge model approach. The former concerns the translation of English concepts in the 
respective languages and then the actual development of synsets whereas the latter implies the 
independent development of monolingual synsets and then their linking to the most equivalent 
synset in the ILI (Vossen 1996). For the implementation of EuroTerm a combination of the 
two models has been adopted aiming at maximal conceptual coverage across languages and 
maintenance of the language-dependent differences. Thus, unlike the extension of EWN with 
computer terminology where the expand model was followed (Vossen 1999) our approach 
slightly differentiates in the sense that both the merge and expand model approaches were 
followed. For the implementation of the project common English lexical resources were used 
as the starting point for the terminology extraction, thus the expand model approach was 
firstly adopted. Once the first set of terms was extracted the merge model was adopted in 
order to check these terms against monolingual lexical resources and enrich them with 
missing terms. The final set of terms will be incorporated in the database through a 
Terminology Alignment System that will enable linking across languages. In the following 
section (2) we present our approach for the terminology acquisition and we continue with a 
brief description of the Terminology Alignment System (3). Finally, some applications of the 
EuroTerm project are discussed (4) and some early conclusions are drawn (5). 

2. Our Approach: A Combination of the Expand and the Merge Model 

The main objective of EuroTerm is to enrich EWN and the ILI with domain-specific 
terminology for the following languages: Greek, Dutch and Spanish. To achieve sufficient 
overlap across languages and to overcome vocabulary completeness and coverage issues 
attention should be paid during the selection of terms that will be incorporated in the semantic 
network. Thus, after a close investigation of the two different approaches (merge and expand 


1 EuroTerm EDC-2214, “Extending the EuroWordNet with Public Sector Terminology” funded by the EC 



model) already followed for building semantic networks and having in mind the application 
of EuroTerm we concluded that a combination of both approaches would result in more 
consistent and reliable results. The reason for applying both methods is twofold. First and 
foremost we wanted to assure sufficient overlap in the coverage of monolingual WordNets 
and still maintain language-specific properties and secondly we wanted to enable 
interpretation of the differences found across WordNets once they are incorporated in 
Information Retrieval (IR) systems, which is the envisaged application of the project. The 
basic idea is that the selection of the candidate terms is performed in two phases. During the 
first phase the expand model is applied according to which common English lexical resources 
are used for the terminology extraction. Once the first set of terms had been determined the 
merge model was applied during which monolingual lexical resources were used for the 
terminology acquisition. We started off with a c ommon English environmental corpus of 429 
manually collected documents and English glossaries comprising in total of 4,972 terms along 
with their glosses. We applied a Part-Of-Speech (POS)-tagger (Daelemans 1999) and a 
lemmatizer to the corpus Then the TF*IDT metric was applied to count frequencies for each 
lemma, which were then sorted by their frequency. So far the extraction process has been 
conducted automatically. Afterwards the 4,500 2 3 most frequent terms were semi-automaticall/ 
checked against English environmental glossaries and those found in them were candidates 
for the environmental WordNet. The candidate terms were again semi-automatically checked 
against ILI to trace which of them were already present with an environmental gloss attached. 
Environmental terms missing from the ILI had also to be checked against monolingual 
resources prior to their incorporation in the database. At this stage the merge model was 
applied. In particular, the missing terms were manually checked against monolingual domain 
specific lexica and corpora and their glosses found in explanatory dictionaries were 
investigated in order to conclude on their importance for the underlying languages. In 
addition, environmental terms found in monolingual resources but not in the list of English 
candidates were also checked against monolingual glossaries or thesauri and if they were 
important they were included in the candidates’ list. Importance at this stage was determined 
by the occurrence frequency of a term in the corpus and by its presence in the glossaries. 
Once the selection process is finished English terms missing from the ILI were manually 
translated to the respective languages and receive an environmental tag. Moreover, terms 
found in monolingual resources were also manually translated in English and were 
incorporated into the ILI. Afterwards the actual development of synsets will follow based on 
the language internal relations used in the EWN project. Since the project is still running and 
due to the fact that the testing phase has not smarted yet we cannot report on concrete results 
about terminology mismatches and alignment across languages. However, it is estimated that 
the problems that might appear will mostly concentrate on language particularities and will 
not be due to the methodology we followed for their acquisition. 

3. Terminology Alignment System 

The Terminology Alignment System (TAS) (Hoppenbrouwers, 2001) is a key part of the 
infrastructure underlying the EuroTerm project. Through the TAS, partners can communicate 
and coordinate their work on the individual W ordNets in Spanish, Greek, and Dutch. We call 
the TAS an alignment system, since it helps terminologists tp align their work on the local 
WordNets. The TAS is not a unified central database in which local WordNets are merged. 
Instead, it is a simple link database, designed to facilitate cooperative work on local WordNets. 
Although the project itself is based on a common starting fund of 1,000 terms, future 


2 4,500 terms was a sufficient amount of terms to process in order-to end up with 1,000 environmental terms, 

which was our target. . . ' , , „ 

3 They were firstly checked automatically and then the selection of the candidate terms was verified manually. 

144 



applications will require a much looser connection between partners. Therefore, the concept of 
federation, where all partners cooperate but there is as little mandatory standardization as 
possible on the local level, is exploited to the max imum. Certain standards must always be 
followed, but we assume that the existing practice hi EWN provides this necessary minimum. 
The plethora of existing WordNet tools, such as Polaris and the various open source systems, 
can and will still be used for local WordNet maintenance. The TAS is a federated, networked 
tool that links up the local tools and has been built as a database-driven Web. 

4. Applications of the EuroTerm Project 

The most immediate application of the multilingual domain specific WordNet concerns the 
incorporation of the environmental domain into an Information Retrieval (IR) system so that 
documents are semantically and not lexically represented in the index of such systems. Thus, 
query terms will be compared against documents not only by weighting term co-occurrence 
but also by measuring the semantic similarity of query and document sets of indexes. 
Towards this direction some modifications will need to be done in the search engine(s) in 
which EuroTerm network will be incorporated. In particular, a directory named Environment 
will have to be incorporated in the interface of the engine so that users have the ability to 
specify whether they are interested in the domain of environment or not. In addition, the 
engine should keep two separate indexes. In the first one environmental documents will be 
clustered whereas in the second one the rest documents will be stored. Our objective focuses 
on achieving better precision scores of the obtained results if the search is performed against 
the environmental index. In addition, it is expected that by mapping query terms against the 
correct environmental synsets would significantly improve retrieval results in terms of both 
precision and recall, making the obtained results more meaningful to the end user. 

Conclusions 

We have described a combination of the expand and merge model approach to extend EWN 
with environmental terminology through a Terminology Alignment System. The use of 
EuroTerm will be demonstrated in an IR environment with the expectation that it will improve 
recall and precision of the obtained results in a meaningful way. 

A cknowledgem ents 

We wish to thank all partners of the project for their contribution and Dr. Piek Vossen and Prof. 
Christiane Fellbaum for their support. 

References 

Daelemans W., Zavrel J. (1999) “Recent Advances in Memory-Based Part-Of-Speech Tagging " In 
Actas del VI Simposio Intemacional de Comunicacion Social, Santiago de Cuba pp.590-597, 
pub: ILK-9903 

Hoppenbrouwers J. (2001) “ Requirements of the Terminology Alignment System ", Technical Report 
EuroTerm EDC-2214, D.3.2 

Vossen P. (1996) Right or Wrong: Combining Lexical Resources in the EuroWordNet Project . In 
Proceedings of the Euralex Conference, pp. 715-728. 

Vossen P., Bloksma L., Peters W., Kunze C., Wagner A., Pala K., Vider K., Bertagna F. (1999) 
“Extending the Inter-Lingual-Index with new Concepts". Deliverable 2D010, Euro WordNet, LE2- 
4003 

S. Stamou { stamou@cti.gr }, A. Ntoulas fntoulas@cti.grj and D. Christodoulakis { dxri@cti.gr /, can 
be reached at Databases Laboratory, of Computer Engineering & Informatics Department, Patras 
University, Greece. J. Hoppenbrouwers fhoppie@kub.nl) can be reached at CentER Applied 
Research, Tilburg University, The Netherlands and M. Saiz-Noeda fmax@dlsi.ua.es } can be reached 
at the Department of Software & Computing Systems, Alicante University, Spain 


145 



Comparing Ontology-based and Corpus-based Domain 
Annotations in WordNet 


Bernardo Magnini 
Carlo Strapparava 
Giovanni Pezzulo 
Alfio Gliozzo 


Abstract 

Domain information has been regarded as an emerging topic of interest in relation to WordNet. A 
lexical resource, WordNet DOMAINS, is presents, where WORDNET synsets have been annotated 
with domain labels such as Medicine, Architecture and Sport. This annotation reflects the lexico- 
semantic criteria adopted by humans involved in the annotation. However, from a corpus-based 
perspective, domains reflect term distribution in a given text collection. The paper proposes a 
preliminary investigation aiming at comparing and integrating ontology-based and corpus-based 
domain information. 

1. Introduction 

The starting point of this paper is WORDNET DOlviAINS (Magnini and Cavaglia, 2000), an extension 
of WordNet 1.6 (Fellbaum, 1998) in which synsets have been annotated with one or more domain 
labels. The annotation methodology was mainly manual and based on lexico-semantic criteria which 
take advantage of the already existing conceptual relations in WordNet. However, a question is how 
well this annotation reflects the way synsets occur in a certain text collection. This issue is particularly 
relevant when we want to use the manual annotation for text processing tasks (e.g. word sense 
disambiguation). To make the problem more concrete, let us consider the WordNet synset {heroin, 
diacetyl morphine, H, horse, junk, scag, shit, smack} which is annotated with the Medicine domain 
because "heroin" is a drug (i.e. something that is used as a medicine or narcotic) and drug behavior is 
best described as part of medical knowledge. On the text side, if we considers news collection (e.g. 
the Reuters corpus) the word "heroin" is likely to occur in the context either of crime news or 
administrative news, without any strong relation with the medical field. This is a typical difference, 
where the manual annotation considers a technical use of a word, while texts record a wider context of 
use. 

The rationale underlying this paper is that both the sources carry relevant information and that 
supporting ontology-based domain annotations with corpus based domain distributions will probably 
give the best potential for content-based text analysis. As a first step in this direction we propose a 
methodology to automatically acquire domain information for synsets in WordNet from a categorized 
corpus. The Reuters corpus has been used for experiments, mainly because this collection is freely 
available for research purposes and because news are categorized by means of topic codes, which 
makes the comparison with WordNet DOMAINS easier. We have limited our investigation to a small 
set of topic codes, namely those labels in the Reuters which can easily be mapped to labels used for 
the ontology-based annotation of WordNet. Comparing the two annotations for a number of synsets 
has pointed out some relevant phenomena which need to be taken into consideration in the perspective 
of a large-scale automatic acquisition of domain information for WordNet synsets. 

Our interest in domain information is motivated by its utility in many concrete scenarios, including 
word sense disambiguation (WSD) and text categorization (TC). In WSD the underlying hypothesis is 
that information provided by domain labels offers a natural way to establish semantic relations among 
word senses, which can be profitably used during the disambiguation process. In particular, domains 
constitute a fundamental feature of text coherence, such that word senses occurring in a coherent 
portion of text tend to maximize domain similarity. Results recently obtained at the SENSEVAL-2 
initiative with a system based on domain information (Magnini et al., 2001) confirm the working 
hypothesis. 



In TC, categories are symbolic labels, and no additional knowledge, ne } AeI 

declarative nature, related to their meaning is available. A resource such as WORDNet Domains 

could improve the integration of linguistic knowledge into traditional statistical approaches. 

The paper is structured as follows. Section 2 presents WordNet Domains, describing the annotation 
methodology and some problematic aspects. Section 3 describes the procedure to automatically 
acquire domain information from the Reuters collection. Section 4 shows the results of an experiment 
performed on a subset of the corpus. Finally, Section 5 reports some relevant related works. 

2. WordNet domains 

Domains have been used both in Linguistics (i.e. Semantic Fields) and in Lexicography (i.e. Subject 
Field Codes) to mark technical usages of words. Although this is a n ““i" “ 

discrimination, in dictionaries it is typically used only for a small portion of the le , xl “" 

DOMAINS is an attempt to extend the coverage of domain labels within an already existing le 
database, WordNet (version 1.6). Synsets have been annotated with at least one domain label, 
selected from a set of about two hundred labels hierarchically organized (see (Magmm and Cavaglia, 
2000) for details about the domain taxonomy). 

Information brought by domains is complementary to what is already in WORDNET. First ° fal ' a 
domain may include synsets of different syntactic categories: for instance Medicine group 
senses from Nouns, such as doctor#! and hospital#!, and from Verbs such - 
domain may include senses from different WORDNET sub-hierarchies (i.e. deriving from different 
"unique beginners" or from different "lexicographer files"). For example, S P°« COI1 ‘ a ' ns s “ s = s aucF 
as athlete#?, deriving from life_form#l, game_equipment#l, from physical_object#l, sport#l, from 

act#2, and playing_field#l, from location#!. 

Finally, domains may group senses of the same word into homogeneous clusters vJ*** 1 ^^ 
of reducing word polysemy in WORDNET. Table 1 shows an example. The word bank has ten 
different senses in WORDNET 1.6: three of them (i.e. bank#l, bank#3 and bank#6) can be grouped 
ECONOMY domain, while bank#2 and bank#7 both belong to GEOGRAPHY and GEOLOGY, 
causing the reduction of the polysemy from 10 to 7 senses. 

For the purposes of the experiment reported in this paper we have considered a set of 41 disjomt 
labels which allows a good level of abstraction without loosing relevant information (i.e. in the 
experiments we have ufed SPORT in place of VOLLEY or BASKETBALL, which are subsumed by 

Sport). 

The procedure for synsets annotation with domain labels is based on lexico-semantic criteria which 
exploit the WORDNET taxonomy. First, a small number of high level synsets are 
with their Dertinent domain. Then, an automatic procedure exploits some of the WORDNET relations 
t KSToponymy, metonymy, antonymy and pertain-to) to extend the manmd^ = enfr 
to all Z reachable synsets. As an example, this inheritance-based procedure a' ow to autornat cally 
mark the synset {beak, bill, neb, nib} with the code ZOOLOGY, starting from the synset {bird} and 
following a "part-of' relation. 


Sense Synset and Gloss -_-- 

#1 depository financial institution, bank, banking 

_ concern, banking company (a financial institution. J 

#2 bank (sloping land...) ___ 

bank (a supply or stock held in reserve...) - 

#4 han k, bank building (a building.. J --- 

#5 bank (an arrangement of simi lar objects...)- 

#6 savings bank, coin bank, money box, bank (a 

container...)------- 


Domains ___ 

ECONOMY 

Geography, Geology 

Economy _ 

Architecture, Economy 

Factotum _ 

’ economy 





#7. 

bank (a long ridge or pile...) 

Geography, Geology 

#8 

bank (the funds held by a gambling house...) 

Economy, Play 

#9 

bank, cant, camber (a slope in the turn of a road...) 

Architecture 

#10 

bank (a flight maneuver. .. ) 

Transport 


Table 1: WordNet senses and domains for the word "bank' 


There are cases in which the inheritance procedure has to be blocked, inserting an "exception", to 
prevent a wrong propagation. For instance, barber_chair#l, being a "part-of' barbershop# 1, which in 
turn is annotated with COMMERCE, would wrongly inherit die same domain. To deal with these cases, 
the inheritance procedure allows the declaration of exceptions, such as: 
assign shop#l to Commerce 

with exception [part, isa, shop#l] 

which assigns the synset shop#l to COMMERCE, but excludes all the parts of the daughters of shop#l, 
such as barbershop# 1. 

2.1. Factotum 

There are a number of WordNet synsets that do not belong to a specific domain, but rather they can 
appear in almost alt of them. For this reason, a FACTOTUM label has been created which basically 
includes two types of synsets: 

• Generic synsets, which are hard to classify in a particular domain, such as: 
man#l an adult male person (as opposed to a woman) 
man#3 the generic use of the word to refer to any human being 
date#l day of the month 
date#3 appointment, engagement 

They are generally placed high in the WordNet hierarchy and are related senses of highly 
polysemous words. Many verb synsets fall in this category. 

• Stop Senses synsets, which appear frequently in different contexts, such as numbers, week days, 
colors, etc. These synsets usually include non polysemous words and behave much as stop words, 
because they do not significantly contribute to the overall sense of a text. 

2.2. Specialists VS Generic Usages 

Domain labels (about 250) in WORDNET DOMAINS have been selected from dictionaries and then 
structured in a taxonomy which follows the Dewey Decimal Classification (DCC (Diekema, 1998)). 
The annotation task consists of interpreting a WordNet synset with respect to the DCC 
classification. A relevant problem arises for those synsets that occur in a well defined context (i.e. 
specialist) in the WordNet hierarchy, but having a wider (i.e. generic) textual usage. 

An example of this situation is the synset {feeling - (the psychological feature of experiencing 
affective and emotional states...)}, which, given its definition, could be annotated under the 
PSYCHOLOGY domain. However, at least intuitively, the use of this synset in documents is broader 
then the psychological discipline, and a FACTOTUM annotation would be more coherent with the 
constant distribution of this term in all the domains of the Reuters corpus. 

What we expect from the corpus-based acquisition procedure described in the next Section is domain 
information which we can use either to modify or to integrate the domain annotation based on DCC. 

3. CORPUS-BASED ACQUISITION 

This section reports the methodology used to automatically acquire domain information from the 
Reuters corpus and to compare it with domain annotations already present in WORDNET DOMAINS. 


148 







The following steps have been carried out: (i) linguistic processing of the corpus, which includes POS 
tagging, multiwords identification and filtering on WORDNET; (ii) acquisition of domain information 
for WORDNET synsets, based on probability distribution in the corpus; (iii) matching of the acquire 
information with domain manual annotation. 

3.1. EXPERIMENTAL SETTING 

The Reuters Corpus (Reuters, 1997) is a collection of about 390,000 English news freely available for 
research purposes. Each news is annotated with one or more Topic Code , selected from a set of 127 
labels, covering a variety of topics, even if economy and politics are prevalent. The mapping between 
Reuters Topic Codes and WORDNET DOMAINS domain labels is not trivial: domains, are different both 
in their extension and in their structure. For the acquisition experiment described in this paper, a 
limited subset of the Reuters Topic Codes has been considered, which can be easily mapped to the 
domain labels used in WORDNET DOMAINS. 


Domain 

Religion 

Art 

Military 

Law 

Sport 


Topic Codes 

GREL 

GENT 

GVIO 

GCRIM 

GSPO 


# Reuters tokens 
307219 
400637 
3798848 
2864378 
2230613 


Table 2: Domain Set for the acquisition experiment. 

The selected Topic Codes (i.e. the domain set) are GREL (Religion), GENT (Arts, Culture, Entartein- 
ment), GVIO (War, Civil War), GCRIM (Crime, Law Enforcement) and GSPO (Sports), for a total of 
about 18 million tokens in the Reuters corpus. 

3.2. LINGUISTIC PROCESSING 

The subset of the Reuters corpus was first lemmatized and annotated with part-of-speech tags. The 
Tree dagger) developed at ^University of Stuttgart (Schmid, 1994) has been used m this phase as 
well as the WORDNET morphological analyzer, to resolve ambiguities and lemmatiza ion w*dnk 
Then a filter was applied to identify the words actually contained in WORDNET 1.6 mcluding 
multiwords. This process resulted in the individuation of 36,503 lemmas which include 6,137 

SrSows the mapping between Reuters Topic Codes and WORDNET domains, as well as the 
number of tokens in the Reuters Corpus which are in WordNei 1.6 for each domain. 

3.3. ACQUISITION PROCEDURE 

Given a synset in WORDNET DOMAINS, the acquisition procedure aims at identifying which domain, 
among those selected for the experiment, is relevant in the Routers corpus for that synset. 

As a first step a Relevant Lemma List for a synset is built as the union °f the synonyms and of the 
content words of the gloss for that synset. This list represents the context of the synset in WORDNET^ 
and will be used to estimate the probability of a domain in the corpus givem the synset. Th s 
probability is collected in a vector, called Reuters Vector, with a dimension in each domam.Theva ue 
of each dimension is the probability of the domain given the synset and it is calculated wrth the 

following formula: i. P(s\D)P(D) 

where D is a domain in the Reuters collection and s is a synset in WordNet. 

The probability of the synset for a domain is assumed to be conditioned by the probability of its most 


149 



related lemmas (i.e. those in the Relevant Lemma List) and is calculated with the following formula: 

P(s | D) -/>«♦ l m \D)=l\P(l i \D) 

/=i 


where / is a lemma in the Relevant Lemma List. 

The probability of a lemma given a domain is its relative frequency 
with the following formula: 

rf ., m cQ,\D) + * 
(il) c(D)+m 


in that domain and it is calculated 


where c(l, D) is the number of occurrences of the lemma / in the domain D, c(D) is the total number of 
word occurrences in the domain and \L\ is the total number of the lemmas in the considered corpus. 
For each domain, P(D) is assumed to be 1, because we have not special domain requirements to fit. 


3.4. Matching with Manual annotation 

In addition to the Reuters Vector, for each synset we have built a vector, called WordNet Vector , with 
five dimensions, one for each selected domain; this was simply done by scoring 1 the selected domain 
and 0 all the other four domains. 


At this point we have, for each synset, a WordNet Vector and a Reuters Vector. These vectors are 
first normalised, then the scalar product between them has. been calculated. We adopt this measure as 
an index of the Proximity Score between the two sources of domain information. This measure ranges 
from 0 to 1 and it is indicative of the similarity between the two annotations. 

The scoring for each dimension of the Reuters domain vector (i.e. the probability to be in a given 
domain if a synset occurs), is interpreted as the relevance of a synset in a domain. Note that just five 
domains were selected for the experiment and a closed world is assumed. 

4. Results and Discussion 

In WORDNET Domains there are 11,993 synsets with at least one manual annotation belonging to the 
selected domain set: 2094 for RELIGION, 2878 for ART, 4250 for SPORT, 1376 for LAW and 1395 for 
Military. 

In the rest of this Section we discuss the application of the acquisition and comparison methodology 
to some relevant cases. 

4.1. Synsets with Unique Manual Annotation 

In this experiment two restrictions have been applied to the initial set of 11,993 synsets: (i) a synset 
must have at least one word among its synonyms occurring at least one time in the Reuters Corpus 
and (ii) it must have just one domain annotation in WORDNET Domains. This selection produced an 
experimental set of 867 synsets. As expected, the average value of Proximity Score obtained after the 
comparison of the Reuters and WordNet vectors for this subset of synsets was very high (i.e. 0.96), 
indicating that this subset of synsets is very relevant for the selected domains. 

As an example, consider the synset, (baseball, baseball game, ball game - (a game played with a bat 
and ball between two teams of 9 players; teams take turns at bat trying to score run)}, which was 
manually annotated with the Sport domain. Its WordNet Vector shows 1 in SPORT and 0 elsewhere. 
Applying the acquisition procedure described in Section 3 over the corpus with the Relevant Lemma 
List extracted from the synset, the resulting Reuters Vector for the synset is reported in Table 3. The 
high score of Sport with respect to the other domains indi cates a strong collocation of the synset in the 
Reuters, which is also confirmed by the Proximity Score (i.e. 1) resulting by the comparison of the 
synset manual annotation. 


150 



Law 

Art 

Religion 

Sport 

MILITARY 

1.82^° 

2.44 c ** 

1.71-“ 

l 

2.45 ®' 63 


Table 3: Reuters Vector for {baseball, baseball game, ball game} 


Similar results are obtained for words whose senses belong to different domains among the five we 
have selected. This is the case, as instance, of the lemma "icon", that has three senses in WordNet 
1 6 Icon#l (Computer Science) a graphic symbol (...) that denotes a program... was marked by 
WORDNET Domains annotators as Computer Science and thus was not used for the experiment. 
Icon#2 {picture, image, icon, ikon - (a visual representation of an object or scene...)} was annotated 
with ART and Icon#3 {icon, ikon - (a conventional religious picture...)} was annotated with 
Religion. 

The Reuters Vectors obtained for Icon#2 and Icon#3 are represented in Tables 4 and 5. While the 
ann otated domains are different, in both cases the Proximity Score is 0.99. 


Law 

Art 

religion 

Sport 

Military 

1.52* 8 

0.99 

.. ... —-j 

7.28 6 " 05 

4.090*^ 

3.87 c ' 08 


Table 4: Reuters Vector for {picture, image, icon, ikon} 


Law 

Art 

Religion 

Sport 

Military 

0.0005 

4.2(r 45 

0.99 

1.92-* 1 1 

0.0006 


Table S: Reuters Vector for {icon, ikon} 


4.2. Synsets with Multiple Manual annotation 

A number of synsets of the initial set are annotated with multiple domain labels in WORDNET 
Domains. This is the case of the adjective canonic#2: {canonic, canonical - (of or relating to or 
required by canon law)}, which is annotated with two labels, RELIGION and LAW. The corresponding 
Relevant Lemma List is the following: canonic#a canonical relate#v required canon#n law#n. The 
Reuters Vector for canonic#2 is shown in Table 6. We can see that the most relevant domains are 
Religion and Law, with a Proximity Score of 0.99. 


Law 

ART 

Religion 

Sport 

Military 

0.41 

9.48 e4/ 

0.56 

0.004 

0.02 


Table 6 : Reuters Vector for {canonic, canonical} 


4.3. Factotum annotations 

Factotum synsets do not belong to any specific domain, and should have high frequency in all the 
Reuters texts. For instance, the synset containing the verb "to be" in its first sense, {be -- (have the 
quality of being)}, corresponds to the Reuters Vector represented in Table 7, which manifests a very 
high Proximity Score (i.e. 0.99) with respect to the WORDNET vector. 


Law 

Art 

religion 

Sport 

Military 

0.21 

0.29 

0.20 

0.16 

0.20 


Table 7: Reuters Vector for {be} 


151 









4.4. Mismatching Annotations 


For some synsets the two vectors produce different classifications. As instance, the synset {wrath, 
anger, ire, ira -- (belligerence aroused by a real or supposed wrong (personified as one of the deadly 
sins))}, is annotated with Religion, inherited from its hypemym {mortal sin, deadly sin}. However, 
the preferred domain acquired from the Reuters Vector, shown in Table 8, is MILITARY. This is 
mainly due to the fact that many lemmas in this synset pertain to the MILITARY domain. The only 
lemma that forces our choice in the RELIGION domain is "deadly_sin", which is very rare in the 
Reuters corpus. 


Law 

Art 

RELIGION 

Sport 

MILITARY 

IF" 

3.S -44 

5.2“ ,S 

3.T 4 * 

1 


Table 8: Reuters Vector for {wrath, anger, ire, ira} 


4.5. Covering Problems 

For some synsets the Relevant Lemma List is not covered enough in the Reuters corpus to produce a 
significant domain classification. This is the case, for instance, of the synset {Loki — (trickster; god of 
discord and mischief; contrived death of Balder and was overcome by Thor)}, manually annotated 
with RELIGION, due also to its hypemym {deity, divinity, god, immortal}. Its Reuters Vector is shown 
in Table 9. 


Law 

Art 

Religion 

Sport 

Military 

2.1 O' 44 

1.45- 131 

2.63 13 

6.78-“ 

l 


Table 9: Reuters Vector for {Loki} 


The preferred domain, MILITARY, depends both on the absence in the corpus of some lemmas (i.e. 
Loki, Balder, Thor) and on the presence of terms strongly related to the MILITARY domain (i.e. 
discord, death, overcome). 

5. Related Works 

The importance of domain information in relation to WordNet has been remarked by several works 
in the last years. 

(Gonzalo et al., 1998) emphasizes the role of domains for WSD. Following this line, (Magnini and 
Strapparava, 2001b) introduced "Word Domain Disambiguation" (WDD) as a variant of WSD where 
for each word in a text a domain label (among those allowed by the word) has to be chosen instead of 
a sense label. We also argued that WDD can be applied to disambiguation tasks that do not require 
fine-grained sense distinctions, such as information retrieval and content based user modeling. For 
example, (Magnini and Strapparava, 2001a) describes the use of a content-based document 
representation as a starting point to build a model of the user's interests. 

A closely related work is that of (Buitelaar and Sacaleanu, 2001), which describes a method for 
determining the relevance of GermaNet synsets with respect to a specific domain. A case study on 
three domains (i.e. BUSINESS, Soccer and MEDICAL) is reported, where three different corpora have 
been used. Term Relevance of a synset with respect to a domain is calculated summing up teim 
relevancies for words in the synset and in its hyponyms (with a penalty for missing hyponyms). 
Finally, a methodology for the integration of domain specific information into generic synsets is 
suggested in (Vossen, 2001). 


152 




6. Conclusions 


In this paper we have presented a lexical resource, WORDNET DOMAINS, where synsets have been 
annotated with domain labels following lexico-semantic criteria. A preliminary investigation has been 
presented aiming at comparing the lexico-semantic annotation with a corpus-based annotation. The 
goal was twofold: on the one hand to provide an evaluation of the manual annotation based on 
distribution of domain information in a large corpus; on the other hand to integrate probabilistic 
information into WORDNET DOMAINS. 

A procedure for the automatic acquisition of domain information from domain-annotated corpora, as 
well as the results of an experiment over a subset of the Reuters corpus, have been described. Results 
show that an high degree of matching between ontology-based and corpus-based annotations can be 
reached for a limited number of domain relevant synsets. The lower degree of correspondence reached 
for some synsets is due either to limitations of coverage (i.e. words in WordNet not present in the 
corpus) or to different interpretations of the synset (e.g. specialistic versus generic interpretation). 

On the other hand we have noticed that it is often difficult to find a mapping between ontology-based 
and corpus-based annotations and probably it does not make sense to seek such an accurate 
correspondence. Indeed, ontology-based and corpus-based domain annotations could play 
complementary roles and future developments should be focused to exploit this distinction, for 
example, in disambiguation or categorization algorithms. 

We consider the work presented here as a first step in the direction of a full and automatic procedure 
for the acquisition of domain information from corpora. For the future we plan to collect and use large 
and diverse domain-annotated corpora, the long term goal of this research being the integration of 
corpus-based domain information within the WordNet taxonomy. At the same time we will continue 
exploring the role of domain information in relation to the use of WORDNET, mainly for word sense 
disambiguation tasks. 

References 

P. Buitelaar and B. Sacaleanu. (2001) Ranking and selecting synsets by domain relevance. In Proc. of 
NAACL Workshop WORDNET and Other Lexical Resources: Applications, Extensions and 
Customizations, Pittsburgh, June, held in conjunction with NAACL2001. 

A. Diekema. (1998) Dewey decimal classification. DDC. 

C. Fellbaum. (1998) WordNet. An Electronic Lexical Database. The MIT Press. 

J. Gonzalo, F. Verdejio, C. Peters, and N. Calzolari. (1998) Applying Euro WordNet to cross-language 
text retrieval. Computers and Humanities, 32(2-3): 185-207. 

B. Magnini and G. Cavaglia. (2000) Integrating subject field codes into WordNet. In Proceedings of 
LREC2000, Second International Conference on Language Resources and Evaluation, Athens, 
Greece, June. 

B. Magnini and C. Strapparava. (2001a) Improving user modelling with content-based techniques. In 
UM2001 User Modeling: Proc. of 8th International Conference on User Modeling (UM2001), 
Sonthofen (Germany), July. Springer Verlag. 

B. Magnini and C. Strapparava. (2001b) Using WORDNET to improve user modelling in a web 
document recommender system. In Proc. of NAACL Workshop WORDNET and Other Lexical 
Resources: Applications, Extensions and Customizations, Pittsburgh, June, held in conjunction with 
NAACL2001. 

B. Magnini, C. Strapparava, G. Pezzulo, and A. Gliozzo. (2001) Using domain information for word 
sense disambiguation. In Proc. of SENSEVAL-2. to appear. 


153 



Reuters. (1997) http://about.reuters.com/researchandstandards/corpus/. 

H. Schmid. (1994) Probabilistic part-of-speech tagging using decision trees. In Proceedings of 
International Conference on New Methods in Language Processing. 

P. Vossen. (2001) Extending, trimming and fusing WORDNET for technical documents. In Proc. of 
NAACL Workshop WORDNET and Other Lexical Resources: Applications, Extensions and 
Customizations, Pittsburgh, June, held in conjunction with NAACL2001. 

Bernardo Magnin, ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica, 1-38050, Trento, Italy, magnini@itc.it 
Carlo Strapparava, ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica, 1-38050, Trento, Italy, 
strappa@itc.it 

Giovanni Pezzulo, ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica, 1-38050, Trento, Italy, pezzulo@itc.it 
Alfio Gliozzo, ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica, 1-38050, Trento, Italy, ghiozzo@itc.it 




Induction of Classification from Lexicon Expansion : 
Assigning Domain Tags to WordNet Entries 


Echo Chang 
Chu-Ren Huang 
Sue-Jin Ker 
Chang-Hua Yang 


Abstract 

The goal of this paper is to present a series of induced methods to assign domain tags to WordNet 
entries. Our prime objective is to enrich the contextual infomiation in WordNet specific to each synset 
entry. By using the available lexical sources such as Far East Dictionary and the contextual 
information in WordNet itself, we can find a foundation upon which we can base our categorization. 
Next we further examine the similarity between common lexical taxonomy and the semantic hierarchy 
of WordNet. Based on this observation and the knowledge of other semantic relations we enlarge the 
coverage of our findings in a systematic way. In the end it is found that the accuracy reflects a 
satisfactory result. 

Introduction 

WordNet encodes a lexicon comprising nouns, verbs, adjectives and adverbs. The basic organization 
is based on different semantic relations among the words. Lexicon of the same meaning is grouped 
into a synset and in our database all the terms are assigned with a unique sense identification number 
for easy retrieval and tracking purposes. This unique offset number gives the information about the 
parts of speech and the hierarchy position to which a specific synest belongs. For nouns and verbs the 
synests are grouped into multiple lexical hierarchies; modifiers such as adjectives and adverbs are 
simply “organized into clusters on the basis of binary opposition (antinomy).” [1] This lexical 
hierarchy makes the lexical domain assigning task more straightforward because it coincides with a 
ontological taxonomy in many aspects. The primary objective of our project is to enrich the WordNet 
knowledge content due to the fact that “WordNe lacks relations between related concepts.” [2] 
We adopt WordNet itself, together with other lexical resources to develop an integrated domain 
specific lexical resource. 

1. The Five Tagging Methods 

We basically follow the five domain tagging methods as listed. We employ the existing data resources 
directly from Far East Dictionary and WordNet al first. Later the lexical hierarchy is used to enlarge 
the coverage of domain - assignment. 

1.1. domain Data from Far East Dictionarv 

The digital file of Far East Dictionary contains complete information for each word entry that can be 
found in an ordinary printed version. Most of all, it lists the domain information for certain vocabulary 
wherever possible. Thus we employ the available data from a text source file (each vocabulary entry is 
organized as one single row) and extract all the information by running a string manipulation program 
coded in Visual Basic. During the extraction process we only take into account the part of speech of 
each word in Far East Dictionary. Next, we map the domains obtained from Far East Dictionary if the 
word and its part of speech coincides with the entries in our database which contains a complete list of 
synset. Due to the fact that WordNet solely organizes words into nouns, verbs, adjectives and adverbs, 
we only extract the domain data that falls into thess four categories. Later we group the information in 
a database table and extent the assigned domains of each word to its synset. Table 1 is an example of 
our database table which contains all the 



id. 

ter hi 

domain 

00073303R 

aback 

aviation, 

00073386R 

aback 

aviation, 


Table 1: Example of The Far East Dictionary Domain Database Table 

In Table 1, it is evident that “aback” has two senses in adverbs. Since in Far East Dictionary “aback” 
is labeled with domain “aviation,” extra work of expansion is necessary to further label all of its 
adverb synset with the same domain to maintain the integrity of the information. Because both the 
extraction and expansion method would produce ambiguities in domain assignment, manual 
verifications are required in the future. 

1.2. DOMAIN INFORMATION FROM WORD NET 

Each WordNet entry (ie. each synset) is followed by its sense. For specific terms that are commonly 
used in a specific field of study, WordNet provides the contextual information such as biology, 
physics or chemistry. This specification comes in a special format contained in a bracket for each 
WordNet entry so that extraction of data is possible and straightforward. Due to the fact that each 
domain is directly extracted by its corresponding synset, there is simply no ambiguity in assigning the 
domain tags. 

1.3. COMMONALITY BETWEEN THE LEXICAL HIERARCHY AND TAXONOMY FOR NOUNS 

At first we collect the domain information presented in Far East Dictionary and the WordNet itself 
and use it as our foundation to generate a basic domain list. Whenever we feel there is an inadequacy, 
more domain categories are supplemented to create a more complete set of data. After we collected all 
the necessary data, we grouped them in a hierarchical manner shown by indentation as the list below. 
By induction and actual examples from the lexical organization in WordNet, it is found that a 
hyponym is very likely to belong to the same domain as its hypemym. For instance, under the term 
“mathematics,” all the hyponyms below are related to this field of study. Under the term 
“mathematician,” there is a set of complete names of prominent mathematicians. To facilitate this 
phenomenon we first make a table of all the follow ing domain terms and map them to their unique 
sense identification number. Later we use the tree expansion method (discussed in more detail in 
Section 2.4 ) to trace down all the hyponyms. For example, by using this method, the hyponyms of 
Linguistics are all labeled as “linguistics” and so forth. 


Humanity _ 

Linguistics 
Rhetorical Device 
Literature 
History 
Archeology 


Social Science 
Sociology 
Statistics 
Economics 
Business 
Finance 


Formal Science 
Mathematics 
Geometry 
Algebra 


156 





Natural Science 

□Physics □ 
□NuclearD 
□ Chemistry □ 
□BiologyD 

□ Palaeontology □ 
D Botany □ 

□ Animal □ 
□FishD 
□BirdD 


Applied Science 

□Medicine □ 

□ Anatomy □ 
□PhysiologyD 

□ Genetics □ 

□ Pharmacy □ 
□ Agriculture □ 


Fine Arts 
Painting 
Sculpture 
Architecture 
Music 
Drama 


Entertainmen t 

Sports 

Balls 

Track & Field 
Competition 
Game 
Board 
Card 


Proper Noun _ 

Name 

Geographical Name 
Country 
Religion 
Trademark 


Humanity 

Archaic 

Informal 

Slang 

Metaphor 

Formal 

Abbreviate 


Lexical Sources 


Latin 

Greece 



Spanish 

French 

American 


1.4. Lexical Hierarchy Expansion of Nouns 

In WordNet the nouns are linked together in the form of a hierarchy. Based on the knowledge we 
start assigning domains to the WordNet entries by observing the tree structure of the lexical hierarchy, 
expanded computationally from the first level to the fifth one. In our database there is a relational 
table as shown in Table 2 Ill- _ 


Hypernvm ID 

Hypoiayn) ID 

Relation | 

00001740A 

04349777N 

rs 

00001740A 

00002062A 

J 

00001740N 

00002086N 

~ 


... 

... 


Table 2: Lexical Relation Table 


The relation symbols in Table 2 conforms with the format in the WordNet database files. These 
symbols are saved with each synset entry to indicate a specific semantic relation with other synset 
The implemented information allows us to trace and locateall Aerelatedsynset. 


Relation 

Symbol 

Antonym 

1 

Hyponym 

- 

Hypemym 

a 

Meronym 

# 

Holonym 

% 

Attribute 

= 


Table 3: Relations and Pointer Symbols 


By manipulating Table 2 with SQL, the eight top starting points of the nouns can be obtained. They 
are where the nouns are originated in WordNet hierarchy. The results as fo llows : 



Entity.something 


_ Abstraction __ 

Act,human action,human activity 

___ State __ 

_ Even t_ 

_ Group,grouping _ 

Phenomenon __ 

_ Possession __ 

Table 4: The Eight Unique Beginners for Nouns 


The general structure of tree expansion can be visualized as Figure 1: 



Figure l: Example of Tree. Expansion for Nouns 


158 













This form of data presentation makes inspection and observation on the hierarchy among nouns more 
straightforward. After careful and systematic examination, a proper domain is assigned to each synset 
level by level. The same task is performed up to the fifth level. Having reached this certain hierarchy 
we are given more confidence and credibility that the hyponyms placed below are very likely to 
belong to the same domain as the hypemyms placed above. Thus a tree traversal program is executed 
to trace down the hyponyms. In turn, the synsets positioned in the lower part of the hierarchy are 
assigned with the same domain as their hypemyms. 

Relational Expansion of Other Parts of Speech 


The hierarchy expansion method based on taxonomical inspection mainly applies to nouns. For 
modifiers such as adjectives and adverbs this general observation does not produce a satisfactory 
result. “The semantic organization of modifiers in WordNet is unlike... the tree structures created by 
hyponymy for nouns or troponymy for verbs.” [1] However there is other information implemented by 
the WordNet which can be adopted for further classification of the modifiers. For example, the 
adjective “stellar” is derived directly from the noun “star.” ITie term, “star” is mostly mentioned in an 
astronomical context. Since “star” is labeled with "astronomy” with 1.4 Lexical Hierarchy Expansion 
of Nouns method, the adjective “stellar” is best suitable to be assigned with the same domain. We 
combine the tables on the left side and right side of Table 2 Lexical Relation Table to obtain a table, 
organized as follows: 


Domains 
Tagged 
Noun ID 


Lexical 

Un-tagged 

Relation 

Adj/Adv 


ID 


Figure 2: JOIN Method 


Later the recordsets that have the relation symbol as “\”(denoted “derived from,” refer to Table 2) are 
extracted and these derived adjectives and adverbs of nouns are further assigned with the same domain 
as the nouns they are derived from. 

• .u 

Results „v 

There are 99,642 unique senses organized by WordNet. By expanding each specific vocabulary 
coupling with its specific senses, the number of these “word & sense” unique pairs total up to 
173,941, which is the basis for all the results. 


Parts of Speech 

Percentage in Total 

Noun 

66.87 % 

1 

17.18% 

Verb 

12.69 % 

Adverb 

3.27 % 


Table 5: Percentage of Each Part of Speech in The 173,941 “Word & Senses Pairs” Entries 

1.6. Far East Dictionary 


There are 20,126 senses that have been assigned with a domain tag with Far East Dictionary, which 
account for 20.20 % of the total sense (99,642 senses in total in WordNet). However after expanding 
it to its synset the total “word & sense” pairs, there are 42,643 entries being tagged, which account for 
24.52 % of the 173,941 pairs in total. 


159 














Parts of Speech 

Number 

Tagged 

Percentage in 
Total 

Noun 

29,946 

17.22 % 



3.56 % 

Verb 

mmm 

3.54% 

Adverb 

349 

0.20 % 


Table 6: Tagging Percentage with Far East Dictionary 


Parts of Speech Vs. Tagging Percentage 



Noun Adjective Verb Adverb 

Parts ofSpeech 


Figure 3: Tagging Percentage with Far East Dictionary 


1.7. INFORMATION PROVIDED BY WORDNET 

The tagging percentage by extracting information directly from WordNet is as follows 


Parts of Speech 

Number Tagged 

Percentage in Total 

Noun 

1,826 

1.050% 

Adjective 

lllHSillEii 

0.863 % 

Verb 

2 

0.001 % 

Adverb 

109 

0.063 % 


Table 7: Tagging Percentage with WordNet 


Parts of Speech Vs. Tagging Percentage 



Parts rf Speech 


Figure 4: Tagging Percentage with WordNet 


160 

































1.8. Tagging by The Taxonomical Structure 

The result of the Taxonomical Lexicon Structure Expansion method is as follows : 


Number of 
Sense Tagged 
(single level) 

Sense Tagged 
After Tree 
Expansion 

Word & Sense 
Pairs After Tree 
Expansion 

Percentage In 
Terms of Total 
173,941 Pairs 

458 


41,770 

24.01% 


Table 8: Tagging Percentage By Taxonomy 


1.9. Tagging by inspection of the Lexical hierarchy of Nouns 

We observe the sense meaning of each synset and label the domain by inspection. At first we observe 
the second level, label the recognizable domain and leave out the ones that are ambiguous. Next we 
expand to the third level and label the domains. The same procedure is iterated until the hierarchy is 
expanded to the fifth level. The following is the number of senses that are tagged by inspection and by 
tree expansion. The total distinct word & sense pairs that have been tagged using 3.4 Taxonomical 
Method and 3.5 Hierarchical Method is 88,971 , which accounts for 51.15% of the total. 


Method 

IMOi 

IIMM 


6 

9-1 


292 

12,544 

■Esssm 

1,171 

■mm 

mssAmm 

373 

Wti ... 


Table 9: Tagging Percentage By Hierarchy (In the total of 99,642 senses) 
After mapping each sense with all the words in the synset, the result is as follows : 


Method 

Sense Tagged 
After Expansion 

Percentage In Terms of 
Total 173,941 Pairs 

2 nd Level 

144 

0.08 % 


22,478 

12.92 % 


51,607 

29.67 % 

5 ,h Level 

9,707 

5.58 % 


Table 10: Tagging Percentage By Hierarchy (in the total of 173,942 pairs) 


Percentage Tagged vs. Level 

35.00% 

30.00% 

25.00% 

5? 

~ 20 . 00 % 

BX 

f 15.00% 
u 

£ 10 . 00 % 

5.00% 

0 . 00 % 

?3 m 5 fi 

Level 

__1 

Figure 5: Tagging Percentage By Hierarchy (in the total of 173,942 pairs) 



161 





























1.10. Relational Expansion of the Modifiers 


First we use Table 3 Relations and Pointer Symbols and map it onto the 88,971( 51.15% of the total) 
entries we produced with Method 2.3 & 2.4. Next we extract the rows that contain the symbol / 
which denotes “derived from” to further extend the domain tags from nouns to the modifiers - the 
adjectives and adverbs. The result is as follows : 


Sense Entries 
with 

Expansion to 
Unique Wor d & 
Sense Pairs 

Percentage In 
Total of 173,941 
Pairs 

2,625 

3,452 

1.98% 


Table 11: Tagging Percentage of Relational Expansion 


TESTING AND DISCUSSION 


The principal testing method we adopt is to first select 200 “ word & sense” pairs randomly from the 
pool of individual results produced by each single method. Method 2.3 is combined with method 2.4; 
together, they are called the tree expansion method in the following analysis. From Table 12 it is clear 
that 2.2 Information from WordNet method has the greatest accuracy while 2.1 Far East Dictionary 
method is ranked second, 2.3 & 2.4 Tree Expansion method placed third, and 2.5 Derivation method is 
rated last. 



Far East 
Method 



Derivati 

-on 

Wrong 

18.00% 

. 2.00% 

27.00% 

24.00% 

Acceptable 

11.00% 

5.50% 


34.00% 

Accurate 

71.00% 

92.50% 

66.00% 

42.00% 


Table 12: The Accuracy Rating of the Four Methods 



Accuracy vs. Methods 


100 . 00 % 
90.00% 
80.00% 
70.00% 
& 60.00% 
| 50.00% 

% 40.00% 

30.00% 
20 . 00 % 
10 . 00 % 


Methods 


Figure 6: Accuracy vs. Methods 


Far East 

WordNet 

Tree Expansion 

Derivation 

24.52% 

1.98% 

51.15% 

1.98% 























Figure 7: Tagging Percentage vs. Methods 


As shown in Figure 7 the tagging entries may overlap. In terms of the accuracy, 2.2 WordNet method 
should be considered as the best approach, with 92.50% accuracy. This direct information extraction 
method from WordNet itself does not attain 100% is due to the fact that only certain words in one 
synset are used in specialized area of studies. For example, in the study of botany, there are a number 
of terms which indicate the same species, however, only a certain words are the actual scientific 
names while the rest are merely common names. In our project, our primary objective is to favour the 
words that belong to the specific area of studies, which is also the main concept upon which our 
lexical taxonomy is organized. 

Based on the extent of domain assignment and the amount of entries covered, Tree Expansion is the 
most ideal method, with 51.15 % coverage. Both WordNet and Tree Expansion methods have their 
own disadvantages and advantages, such as time consumption and the extent of coverage. In terms of 
the WordNet method, extracting data directly from the digital sources is very efficient and the result is 
more reliable. With high accuracy, the revision that may follow later on in the future would be more 
straightforward. However, in terms of the extent of coverage, Tree Expansion is still a more effective 
method. Its result is very encouraging because it contributes to over 51% among the entire domain 
assignment, with a total of 74% correct or acceptable rate. However, it is worth noticing that for all the 
entries in WordNet, not every single entry is supposed to be grouped or defined within a specific 
domain. For instance, all the common grammatical words (a, the, is, etc.) and the high frequency 
words (hit, kick, smile, etc.) would not and should not belong to a special domain. Although we do not 
have a realistic measure for recall, the slightly less than 49% coverage of all senses is quite acceptable. 
So far the number of distinct entries that have been tagged is 103,709, which covers up to 59.62 % of 
the whole 173,942 word and sense pairs. 

2. Future Goals and Improvements 

At present our domain tag assignment is still at a preliminary stage, which requires further 
modifications and improvements. Other method such as bottom up tree traversal is more likely to give 
rise to a better result with higher accuracy. For example, for a hyponym which falls into the domain of 
botany, the hypemym is very likely to belong to the domain “biology.” Extracting sources from a 
large corpora grouped by topics is also a reliable approach. For instance, in a journal related to the 
study in physics, most of the special field-related terms are likely to appear more frequently than in 
other ordinary sources. Other than extracting information from WordNet itself, other thesauruses in 
digital files can be taken into consideration as well. 


163 






There are a significant number of possible applications that can be contributed by domain tag 
assignment. Due to the fact that English WordNet is the most fundamental structure upon which a 
wordnet in other language is based, assigning domain tags to WordNet itself can indeed be expanded 
to other inter-linked wordnets such as EuroWordNet. By categorizing lexicon into groups of different 
domains, it will benefit the study of linguistics : “word sense disambiguation methods could profit 
from these richer ontologies, and improve word sense disambiguation performance.” [2] 

References 

Christiane Fellbaum. (1998) WordNet: An Electronic Lexical Database. The MIT Press. Cambridge, 
Massachusetts. 

Agirre, Eneko et al. (2001) Enriching WordNet concepts with topic signatures. Proceedings of the 
NAACL workshop on WordNet and Other lexical Resources: Applications, Extensions and 
Customizations. Pittsburg. 

Bernardo, Magnini and Gabrela, Cavaglia. (2000) Integrating Subject Field Codes into WordNet. 
Proceedings of the LREC conference. 

Conference on Language Resources & Evaluation. 

Echa Chang, University of Waterloo, 200 University Ave. W., Waterloo, ON N2L 3G1 Canada, 
cecha@yahoo com 

Chu-Ren Huang, Institute of Linguistics, Academia Sinica, Nankang, Taipei, Taiwan 115, churen@sinica.edu.tw 
Sue-Jin Ker, Soochow University, ksj@sun.cis.scu.edu.tw 
Chang-Hua Yang, Sooshow University, changhua@mail2000.com.tw 


164 



A Graphical Tool for Browsing, Searching, and Annotating WordNet 

Arthur Cater 


Abstract 

The paper reports on a tool for working with WordNet. The tool has four major components with 
interacting features. Its “Tangle Browser” summarises the index information for an input word and 
related multiwords, and gives one-click access to graphical mouse-sensitive representations of the 
indexed synsets ls . Using it, portions of the WordNet linked to a synset may be quickly apprehended. 
The tool’s ‘Tree Browser” shows selected parts of the collections of noun or verb synsets, organised 
as a tree based on the hypemym relation commonplace in those kinds of synset. The tool’s “Scanner” 
allows synsets having virtually unrestricted combinations of properties to be identified. The tool’s 
“Annotator” allows additional information to be recorded for synsets, in a systematic way whilst also 
allowing freestyle commentary. The tool may be used with versions 1.5,1.6 and 1.7 of WordNet. 

1. Introduction 

WordNet is a freely available and widely used electronic dictionary, which contains information about 
approximately 100,000 English words: Fellbaum (1998) describes it and reports much work using it. 
It is unusual in that its entries are connected by links that represent a variety of semantic and lexical 
relations, so that the whole dictionary is a network of lexical entries. Software accompanies the 
dictionary, allowing one to interactively interrogate the dictionary, and to access dictionary entries 
under client program control. 

WordNet is not only a resource for use in Natural Language Processing applications. It is also an 
object of research in its own right, since it is necessary to understand its structure and to discover its 
content (and its limitations) before putting it to effective use and before setting out to modify or 
enhance it. The software provided for browsing the dictionary allows one to look up entries for word 
forms, and to follow links from one entry to another, but it does not support any particularly 
sophisticated access to, or analysis of, collections of entries. The perceived need for more advanced 
browsing and searching facilities is clearly not new, and may be gauged from the extensive set of such 
facilities listed on the WordNet website: WordNet (2001). 

This paper reports on a tool developed to support browsing, searching, and annotation of WordNet. It 
has two kinds of browser: one that allows arbitrary links to be followed from entry to entry, and 
another that emphasises the hierarchical hypemym/hyponym relationship that is characteristic of the 
nouns and verbs, and the groupings of adjectives. These two browsers are sufficiently different to be 
classed as separate components. The tool also has a search facility, referred to as “the Scanner”, which 
allows one to identify and isolate entries with particular properties. In the tool there is also an 
annotation facility, which allows for recording information about existing entries other than what 
WordNet itself provides. These four major facilities are all accessed through graphical user interfaces, 
and interact with one another. The entire tool is implemented in Macintosh Common Lisp: Digitool 
( 2001 ). 

The four major components will be introduced individually in §2 through §5. Several minor facilities 
are also present but some are still at an early stage of development; they will be mentioned 
individually in passing only for the sake of completeness. In conclusion, §6 comments on the utility of 
the tool and describes plans for its future improvement. 


15 The term “synset” is a standard abbreviation for “synonym set”. Each distinguished sense of a word is grouped with nearly 
synonymous senses of other words, and a variety of information is provided which generally applies to all grouped word- 
senses. See Fellbaum (1998) for further details. 



2. The Tangle Browser 


The Tangle Browser 16 allows one to type a word and see a pictorial representation of its index entries, 
and also those of multiwords starting with the typed word. With a click, one may view any or all of 
the synsets for senses of that word. With another click, one can follow any or all the WordNet 
pointers from a synset to other synsets; see which lexicographer file contains the synset definition; see 
the usage codes (of verbs); and access other component tools. 


Figure 1 shows a screenshot of the Tangle Browser in action (using WordNet vl.7). 



Figure 1: Screenshot of the Tangle Browser 


• In the panel at top left the word “tangle ” has been typed. Across from it are several fixed controls, 
with mostly obvious functions of no particular interest here. 

• Beneath it (in a separately scrollable area of the; window) are summaries of the index entries for 
“tang/e” and for those multiwords with “ tangle ” as first constituent word. Each summary is 
shown as an ensemble of panels: a word panel, beside it 3 pictorial panels explained later, below 
it panels indicating numbers of senses of various parts of speech. We see that “ tangle ” has 2 noun 
entries and 4 verb entries. 

• The four verb entries have been clicked on, causing their synsets to be shown. 

• For each synset we see the synonymous senses, the glosses, and other panels. Among these panels 
are ones corresponding to WordNet pointers to other synsets. Lexical links (those relating 
individual word to individual word) are visually distinguished from semantic links (those relating 
all words to all words) by being on a grey background. 

• Two links from the 4 lh verb sense of “ tangle ” have been followed, showing other synsets in the 
same manner; pointers may be followed in this way indefinitely. 

• The three verb usage codes given for the 2 nd verb synset are shown. Likewise the description of its 
lexicographer file (number 35) is shown. 

• The entire gloss of the 3 rd verb synset, normally truncated to fit a panel, is shown whilst the 
mouse button is clicked on it and held. The gloss is decomposed into as many definition and as 
many example fragments as are present. 

• The three objects at bottom have been dragged around manually to improve the layout. 

The index summary ensembles have three attached icons, 1150 The first is a sketch of a graveyard 

headstone, clicking it causes the ensemble to die (i.e. be erased from the display). The second is an 


16 The Tangle Browser is so called because it readily reveals a tangled web of connections amongst synsets, and from synsets 
to other WordNet objects such as lexicographer files and usage codes. The GUI does not have a sophisticated placement 
algorithm for the objects it displays; instead it allows a user to drag the objects around to taste. 


165 



“exploder”, clicking it has the same effect as clicking all synset-summoning panels. The third is the 
converse, an “imploder”, causing all indexed synsets to be erased. 

The synset ensembles have th e same icons, with similar effects, though their headstone is larger. They 
have further icon groups too, I andHBU®??!. These are concerned with connections between 
the Tangle Browser and other facilities of the tool. Briefly, there are connections to the Tree Browser 
(blank yellow), to the Scanner (target), and to the Annotator (addition symbol). There are also 
connections to minor facilities that will not be elaborated upon further in this paper: a rather 
rudimentary parser for the definitions and examples of glosses, and a way of maintaining such lists of 
synsets as one thinks appropriate. 

3. The Tree Browser 

The Tree Browser encourages one to visualise synsets as being arranged within a hierarchy. It works 
with synsets of one part of speech at a time. Noun synsets aie mostly connected to other noun synsets 
using the hypemym relation and/or its converse, the hyponym relation. Similarly for verbs, though 
whilst the relationship in that case is strictly one of troponymy, as Fellbaum (1998b) explains, it is not 
differentiated from hyponymy within WordNet Adjectives too can be handled by the Tree Browser, 
using the “ similar-to ” relationship 17 . There is no meaningful way in WordNet to view adverbs as 
forming a hierarchy however, so the Tree Browser is inapplicable to them. 

Many noun synsets, and likewise many verb synsets, have a single hypemym: a “parent” node. Some 
however have no hypemym, and some have several. The Tree Browser displays a single tree of 
synsets, using two graphical devices to overcome the fact that the links between synsets do not 
actually form one tree but rather a forest of trees. First, an artificial “orphanage” synset is provided as 
the parent for synsets that have none, thus grouping all trees in the forest into one giant tree. Second, a 
contrasting colour is used to draw attention discreetly to the relatively few synsets w ; th multiple 
hypemyms, whilst otherwise ignoring all their hypemyms beyond the first for purposes of tree 
display. 

Each synset is the head of a subtree (degenerate in the case of leaf node synsets). The number of 
nodes in the subtree is referred to as its “dynasty size”. When a synset is displayed in the Tree 
Browser, other numerical data is displayed with it: the number of its immediate hypemyms, its 
dynasty size, and the depth of its deepest subtree. 

The trees of synsets are far too large to be usefully viewed as wholes: there are 12754 verb synsets 
and 74488 noun synsets. Selectivity is essential. There are various ways in which portions of these 
over-large trees may be selected for viewing. 

• Limits may be put on dynasty sizes. Only those synsets in the orphanage whose dynasty sizes are 
within limits will be displayed; then any of their immediate hyponyms with dynasty sizes within 
limits will also be shown, and so on. 

• Options in a menu attached to each synset already on display allow its subtree to be expanded 
either in its entirety or according to newly specified limits on dynasty size. The display is then 
redrawn with additional synsets. 

• Other such menu options allow all immediate hyponyms to be added to the display, or for one or 
more subtrees to be removed from it. 

• Connections with other components of the tool, notably the Tangle Browser and the Scanner, 
allow particular synsets to be added to the Tree Browser display, together with sufficient 
hypemyms to connect back to the root, singly or en masse. 


17 An adjective with similar-to links to many other adjectives can be treated rather as hypemyms are; and an adjective similar 
to just one other can be treated as a hyponym. (Where two are similar only to each other, one is arbitrarily picked as parent). 
The subtrees formed this way are one-level trees. 


167 



Figure 2 shows the Tree Browser in action, showing verbs in WordNet vl.7. 

• Limits of 400 to 1000 had been placed on dynasty sizes. This resulted in the display of the 
artificial “orphanage” synset (which is exempt from restrictions), and six genuine synsets. Each 
one has 4 numeric fields: the number of hypemymsi not shown (always zero and never highlighted 
in this example), then the number of immediate hyponyms, then the dynasty size (six being in the 
400-1000 range), and finally the depth of the subtree. 

• One of the synsets for the verb “ tangle ” was added through use of the Tangle Browser; together 
with its hypemym “ disarrange '” which links it back to the portion of tree already on display. 

• The “ Show Children ” menu option for the “ disarrange ” synset caused the remainder of its 3 
hyponyms to be added to the display. 

• The menu of options for one synset has been summoned by clicking on a panel. 



Figure 2: Screenshot of the Tree Browser (verbose mode). 

The Tree Browser can also be used to display the hienrchical connectivity of synsets while giving no 
textual information about them at all. Figure 3 shows its appearance when used in that terse mode. 
Quite large numbers of synsets can be simultaneously displayed this way but the words of the synsets 
are not shown. This mode of display would not be of any use except for two reasons: first, the Tree 
Browser can interact with other components of the tool, so that for example an individual synset can 
be picked and revealed in its full glory in the Tangle Browser; and second, colour coding can be used 
to convey a little more information. 



The menu attached to any displayed synset provides two ways of removing synsets from the display 
(“ Delete Node ”. “ Delete Siblings ”) and two ways of adding synsets (“Show Children”, “ Expand 
Subtree ”). The menu also provides a way to choose good limit values when expanding a subtree, 
through the “ Histogram” option. Figure 4 shows the histogram produced for the synset containing the 
verbs {act 0; move 0) found at top left of figure 2. That synset has 51 hyponyms, and a dynasty of 
542 other synsets. How are those 542 spread among the 51 hyponyms? The histogram, crude though it 
is, provides this information. One can see from it that one hyponym, corresponding to the blue dot at 
far right of figure 4, has its own dynasty of around 110 synsets: this is because red markers indicate 
increments of 100. One can also estimate that five other hyponyms have dynasties between 25 and 
100. With this histogram, one could set limits for expanding the subtree and know what sort of result 
to expect. 

The remaining menu options connect the Tree Browser to other tool components. Explore Link s 
causes the synset to be shown in the Tangle Browser; “ Sensitise ” is concerned with the Scanner 
component, to which we turn next; and " Extra-self ” and “Extra - subtree" invoke the Annotator on 
one synset or many, respectively. 


168 



























4. The Scanner 


At its simplest, the Scanner allows one to interrogate a collection of synsets, finding out which ones 
have particular combinations of properties. These may then be displayed in the Tangle Browser or the 
Tree Browser or both; they may be subjected to further analysis with the Scanner; they may be passed 
to the Annotator; or they may be stored in a list for future reference. 

The collection of synsets which the Scanner considers is usually set from within the Tree Browser, 
using its “ Sensitise ” menu option on some synset. This causes the synset, and often its whole 
associated dynasty, to become sensitive to scanning. Finer control can be exercised however, by (a) 
desensitising portions of a subtree, and (b) sensitising synsets in a subtree that have dynasties within 
certain limits, similar to but independent of the limits used in determining which synsets to display. 
This has been particularly useful for examining the large number of very small independent subtrees 
of verbs, many with no hypemyms and no hyponyms. Other ways of sensitising groups of synsets 
also exist: for example all synsets can be sensitised that mention verbs listed by Levin (1993) in any 
one of her verb classes. 


Figures 5, 6, 7, and 8 show the Scanner in operation, with the entire subtree of the “move 1; displace 
0 ” synset sensitised. The properties of which the Scanner is aware are shown in figure 5. Any number 
of properties can be chosen. The properties include some, such as Aktionsart and Comments, that 
relate to information that might have been recorded using the Annotator; but most relate to 
information discernible directly from WordNet Figure 5 shows some properties selected, namely 
“Manyparents” - which classifies synsets according to their numbers of hypemyms, “Lfnum” - which 
classifies synsets according to their lexicographer file, and “Has end prep” - which classifies synsets 
according to the preposition following the final underscore of any multiword. Figure 6 shows some of 
the different classifications that sensitised synsets fell into: for this example numerous smallish 
classes are not shown. Each classification has a code number, and the number of synsets in the 
classification is given along with a description of the classification. Note that the classes are not 
necessarily disjoint: while a synset has one and only one Lexicographer File Number, it may contain 
several multiwords and these may have different prepositions. 


□ Ambiguity 
0 Manyparents 

□ Non-uniform frames 

□ Usages 
0 LFnuni 

□ Non-hierarchic 

□ Glosstype 

□ Ampersand Count 

□ Non-uniform pointers 


□ Has end pronoun 

□ Has mid pronoun 
0Has end prep 

□ Has mid prep 

□ Hyphen count 

□ Underscore count 

□ LFNUM change 

□ Comments 

□ Aktionsart 

□ Link to different POS 


Figure 5: Available Scanner Properties 



Figure 6: Classification Result 



0 

9 

a 


v-.|. 

- 


.0 

v- W»v-9 

0 

T 

10 

S 

i o 


0 

0 


0. 


u 

5 

12 

IX 

X 

4 

15 

i ° 

0 

mnm| 

0 

0 

0 

0 

« 

0 

12 

0 

9 

5 

5 

' 0 

0 

0 

wn^ 


0 

0 

5« 

71: 

■ ttfl 

■ ss 

54 

50 

45 

0 


0 

0 

NMM^ 

0 

0 

33 

If 

IV 

24 

24 

21 

10 

0 


0 

t? * 

V. 0 

V 



•• 

’,<1 K 

0 


0 

10 

0 

6 

0 

0 




0 

S 

0 

0 


6 

10 

0 

1 

0 

s 

2 

0 

0 ■ 

IHW\ 

0 

0 

11 

11 

0 

0 

0 

1 

0 

IS 


0 

1 

0 - 


0 

0 

0 

0 

0 

o 

2 

2 

Ml 

■ * 

1 

0 

0 

0 


0 

0 

( 

0 

0 

2 

0 

10 

4 

0 

0 

1 

0 

0 

nhn\ 

12 

4 

0 

2 

2 

0 

20 


0 

2 

1 

0 

0 

2 


3 

0 

I 

2 

I 

23 

12 

0 

2 

0 

0 

1 

1 

1 


5 

' 1 


1 

* 


2 

2 

0 

0 

0 

0 

9 

1 



Figure 7: Matrix showing numbers and proportions of bi-classified synsets 


169 



The Scanner then displays a table showing the numbers of synsets in pairs of classes. 

Figure 7 shows the table for the current example. Tire positions of the rows and the columns both 
correspond to the code numbers of figure 6. Below the diagonal line the numbers 
into both classes is shown; above the line is shown the percentage of the synsets in the smaller c 
which are also in the larger class. The two Highlighted positions show that 9 synsets were both in c ass 
3 (LFNUM 35) and class 9 (endprep in), and that these constituted 52% of the smaller of those two 

classes. 

The Scanner then presents a dialog box which mediates further interrogation and analysis of the 
sensitised synsets. Figure 8 shows this dialog box in use. With the classificatianjombmer (CC) on 
'“ compose arbitrary Boolean queries (using AND, OR and NOT in a Lisp-like parenthesised 
expression) to identify and count the synsets with particular properties. The fi^ ^ows a qu^y 
seeking those synsets with end preposition “in" that are not in lexicographer file 35 that is, the 
missing 48% from those highlighted in figure 7. An informational area describes the portion of 
WordNet that has been sensitised, clarifies the query by replacing numeric classification codes with 
descriptive expressions, and tells how many synsets are “hit” - that are m the resuU se “ 



Results of recent queries: 


Sensi tised syrwet tree roots ware: o 

v 1375667 (dynasty-^ 917>^«hole^dynasity^move Misplac®2. 

No refinements made to orlginal^scan result 

(AND "ENOPREP in" (NOT "LFNUN 35")) -> 

8 synsets 


> r . • W'l 




•7 ■ —-_- '-'- l •' ----- 


SPass result to Tar.qle Browser 


Show whole result in hierarchy 


Store result in persistent lis 


sets in resul 


fil: Show dynasties' Precision ft Recall! 



Current List | New ... ▼ 



min set size 


New Refinement liSflng Befine^nent J _ 

Minimum class size 18 114 classes shown (click) 

Figure 8: Scanner analysis dialog. 


||j| 3 thousandths || 

Parent Refinement 
[ Make it so 


Numerous things can be done with a query result. . 

. Some or all of its synsets can be shown in the Tangle Browser - where target icons become 

marked with an arrow to distinguish synsets on display which are hit. 

. The synsets that are hit can be shown in the Tree Browser - where bold italic type distinguishes 
synsets synsets that are hit. Where the number of synsets hit is large, the ‘e™ mode of T 
Browser Tsplay can become very useful; colour coding then distmgu.shes synsets that are h,t. 

. The hit synsets can be stored in a new list, or added to or removed from an existing one. 


170 





























• The synsets that are hit can be passed en masse to the Annotator, which processes them one by 
one but allows premature exit, useful if large numbers are queued. 

• Each subtree of synsets (or dynasty) can be thought of as a collection of items that individually do 

or do not fulfil the conditions of a query. The precision of the subtree (the proportion of its 
synsets that are hit) and the recall of the subtree (the proportion of hit synsets in the subtree) can 
be computed, the results are displayed as a pair of partially filled circles — “ olockfaces ” - in any 
synset displays in the Tree Browser. * 

• If precision and recall values for subtrees have been computed, “good^-ClUsters” may be 
distinguished from the rest. One may wish to build upon WordNet, providing information, which 
(unless overridden) will be inherited from mother synset to daughter along hypemym links, in the 
manner of Gomez (1998). It is useful for such an enterprise to distinguish subtrees where this can 
be done with worthwhile accuracy (precision) and payoff (recall). “Good clusters” are those 
which maximise the product of precision and recall. Those whose product exceeds a threshold 
(shown as thousandths in the dialog), or failing that, the best few, can be picked out: their 
clockfaces are shown in red, others are in blue. These parameters may be adjusted from default 
values, and goodness reassigned. 

• Good cluster synsets can be passed selectively for Tree Eirowser display. 

• The Scanner can be invoked in a quasi-recursive fashion, treating the result of a CG query as the 
universe of relevant synsets. The numbers of synsets in existing classifications will be recomputed 
and displayed, as will a new matrix of ^classifications, and a dialog box created for fhrther 
analysis of this refinement of the sensitised synsets. 

5. The Annotator 

The Annotator allows observations and additional information to be recorded about an existing synset. 
This information is stored in a file, jointly keyed by the synset’s part of speech and its 8-digit code 18 , 
where it may subsequently be found by both the Scanner and the Annotator itself. The Annotator 
displays fixed WordNet information about a synset: its POS, offset, words, gloss, and verb usage 
codes. It also provides a link to the Tangle Browser, where all other WordNet information may be 
viewed and links may be browsed. The Annotator then shows any existing comments on the synset, 
and provides a panel for input of any new comment. Figure 9 shows the Annotator in action on one of 
the synsets hit by the Scanner CC query used in the example of §4. 

There are a few types of comment that have been made so often that they have been given the exalted 
status of “observations”: for example, notes about spelling errors. Checkboxes allow these to be 
recorded easily and in a standard way that the Scanner could in principle detect 

Checkboxes also allow for recording judgements about the Aktionsart, or aspectual classification, of 
verbs - in an idiosyncratic and unusually rich classification that the author believes appropriate. The 
individual words of a synset can also be rated according to their familiarity and sureness. These again 
are idiosyncratic notions of the author. Familiarity is meant to capture the notion that a normal native 
speaker gets to know this sense of a word very young, or in adolescence, or perhaps not at all in the 
case of very specialised words. Sureness is meant to capture the notion that a normal adult native 
speaker, whilst knowing a sense of a word well enough to understand its use, might nevertheless be 
insufficiently clear about it to define it and insufficiently confident of their understanding to use it 
themselves. Other would-be annotators of WordNet would be very likely to require support for 
different annotations than these. 

In the case of multiwords (or compound words) in a synset, there is provision for recording an 
analysis of their constituents. In the example, the multiword “putJrT could be annotated with the 
analysis “V Particle”. 


18 This is not a wholly satisfactory arrangement since it hinders the cany over of information to upgraded versions of 
WordNet. It is planned that the Annotator shall use sense keys instead. 


171 




I Synseti POS*Vl IQh 0Q7B5906^^ *^-^pnuw m - 

Ru.'SSta. •PPl»»r.job,lm competition, etc.; "We pot In . pront to the HSf 


mm 


.f 


_____ 

New comment? |____ _ __— 

State (continuing) 

State < beginni i»9> 

^SSffl?SSHFJsE State (ending) 

$. ' Activity (Unbounded Process,! 

Accomplishment-end (End of Bounded Process) 
Accompliob-nt-stert (Start, of Bounded Process) 
V Achievement (Point occurrence) 

Undecided (come beck later) 

Familiarity and Sureness offense 


put.ln 

submit ___ 

Close & Inspect" Dangerous 


Familiarity 5, 

Highest-lowest Highest -Lowest 

Q5 Q 4 03 Q 2 Q 1 Q3 Q 2 O 1 
05 04 03 02 01 03 02 01 
) □ Cancel remaining 6 dialogs 


uose - UPUO.OJ-. — 

. _ 

putjn __ — 

Figure 9: The Annotator in use on one synset in a Scanner result set. 

233 = 3233 SS 3 ** 

The Annotator d.« no. ennenfl, provide for el»ging «"^ •»”*/““££*5SJ 
dn»b U e. Thor. » n. ™ however why »»1™ 

not be compatible with standard distribution WordNet software. 

6. Conclusions 

s£:5rr.s2ffrt 

The integration of iheee j„»»“^'‘“f' VS,<ZTX. 

»tU WordNet itself provides. Working tog^he, however, ihe fe.r 


172 


components allow systematic testing of hypotheses, and systematic annotation of synsets with 
particular combinations of properties. 

It would be interesting to adapt the tool to work with WordNets other than those from Princeton. For 
monolingual WordNets that have the same organisation as Princeton’s (that is, they have part-of- 
speech pairs of index files and data files whose individual lines follow the same syntactic rules), such 
adaptation should be quite straightforward. For multilingual WordNets, having additional structures 
such as the ILI (Inter-Lingual-Index) of EuroWordNet (2001), a substantial adaptation effort would be 
required. Some basic design issues would have to be reassessed, since users of such a tool would 
presumably wish sometimes to work with different language WordNets simultaneously. The tool as it 
stands now is able to work with multiple versions of Princeton’s WordNet simultaneously, which 
could form the basis of an extension to multilingual WordNets. However, each version is treated in 
isolation from the others, which would not be appropriate for a multilingual WordNet. 

Future work on the browsing, searching and annotation tool is expected to be of four kinds. First, 
there are dozens of little inconsistencies, for instance in the use of colour coding and in the provision 
of help, which ought to be remedied. Second, there are customisations which users of the software 
might require, for instance to annotate WordNet with different kinds of information, or to scan for 
different properties. A mechanism ought to be provided to allow such customisation without the need 
to understand and modify the program code. Third, the feasibility of an extension to multilingual 
WordNets should be more thoroughly evaluated. Fourth, arguably, in order to attract a wider audience 
the tool should be ported to a more commonly used environment than that in which it currently runs. 

REFERENCES 

Digitool. (2001) http://www.digitool.com 
EuroWordNet. (2001) htt p://www.hum.uva.nl/-ewn/ 

C. Fellbaum. (1998) (ed.) WordNet: An Electronic Lexical Database . MIT Press. 

C. Fellbaum. (1998b) A Semantic Network of English Verbs. In Fellbaum (1998) pp69-104. 

B. Levin. (1993) English Verb Classes and Alternations. University of Chicago Press. 

F. Gomez (1998) Linking WordNet Verb Classes to Semantic Interpretation. In S. Harabagiu (ed.) 
Proceedings of Workshop on Usage of WordNet in Natural L anguage Processing Systems, pp58-64, 
COLING-ACL. 

WordNet. (2001) http://www.cogsci.Drinceton.edu/~wn/links/ 

Arthur Cater. Dept of Computer Science, University College Dublin, Ireland, arthur.cate r@ucd.ie 


173 



Adapting GermaNet for the Web 

Claudia Kunze 
Lothar Lemnitzer 

Abstract 

This paper deals with the adaption of the lexical-semantic wordnet GermaNet for web-based 
applications. The GermaNet data have been converted into XML-conformant documents, which 
represent the concepts and all the basic relations defined between them, while accounting for the 
peculiarities of the German wordnet We also compare GermaNet to the Princeton WordNet in order 
to unify the diverging representations for the use within a polylingual framework. 

Introduction 

In recent approaches to natural language processing, wordnets, which are structured along the same 
lines of the Princeton WordNet (see Miller et al. (1990), Fellbaum (1998)) 19 , have become very 
popular resources. These wordnets model the common and frequent words of a language and encode 
the basic semantic relations established between them, like synonymy, hyponymy, antonymy, and 
meronymy. 

Wordnets can support word sense disambiguation, which is crucial for information retrieval, semantic 
annotation, etc., and are also valuable resources for linguistic research. 

Due to the increasing interest in WordNet as a pioneer resource, various language specific and 
interlingual initiatives to wordnet construction have been launched (GermaNet 20 , French WordNet, 
EuroWordNet 21 , BalkaNet etc.). 

As web-based applications become increasingly important in the field of language engineering, it is 
worth making wordnets accessible to XML-based tools and services, in order to exploit the rich 
semantic structures encoded in them. The origmaf Wo'rdNet has been constructed as a database, on 
which various tools have been built. EuroWordNet followed this strategy. A more recent version of 
WordNet represented as PROLOG clauses has been developed, which is therefore accessible to 
PROLOG-based NLP tools. It seems however quite reasonable to derive an XML representation 
format for wordnets. With XML being a quite powerful standard, more tools and applications will 
build their interfaces on this standard. For example, Chris Manning's tool for browsing and visual 
exploration of a structured Warlpiri dictionary (Manning et al. 2001) might be adapted. There is also a 
EuroWordNet-specific visualisation tool (VisDic, http://nlp.fi.muni.cz/projekty/mt/visdic/). To enable 
polylingual applications, it is useful to construct a common format for the various wordnets. For the 
German wordnet, GermaNet, XML conversion has been carried out at the Linguistics Department of 
the University of Tubingen. 

In this paper we describe how the XML version of GermaNet has been developed. The first section 
outlines the German wordnet and focuses on the major differences between GermaNet and WordNet. 


19 Princeton WordNet (Miller (1990), Fellbaum (1998)) as the first in the field of wordnet construction has evolved as the 
quasi-standard due to various aspects: it models English, has a broad coverage, and is freely available for the research 
community (http://www.cogsci.princeton.edu/~wn). 

20 GermaNet has been developed within the project "SLD: Ressourcen und Methoden zur semantisch-lexikalischen 
Disambiguierung", which was funded by the Ministery of Research of Baden-WUrttemberg in 1996-1997. A second period 
of funding has been granted in 1999-2001. The database was built by Helmut Feldweg (first period's project leader), Claudia 
Kunze, Andreas Wagner, Karin Naumann, Birgit Hamp, Michael Hipp, Valerie B6chet-Tsamou, Susanne Schtlle, Christine 
Thielen and Rosemary Stegmann (http://www.sfs.uni-tuebmgcn.de/lsd). 

21 The EuroWordNet database has been built within two projects: EuroWordNet-1 (LE-4003) and EuroWordNet-2 (LE-4 
8328), funded by the European Commission. The whole project was coordinated by Piek Vossen 
(http://www.hum.uva.nl/~ewn/). 



The conversion process is briefly described in the second section. In section 3, we present the 
respective DTDs (Document Type Definitions). In the fourth section, compatibility issues across the 
English and German wordnet are raised in some detail. A few examples are given in the last section. 
The conclusion summarises future perspectives with regard to wordnet applications and standards. 

1. Differences between GermaNet and WordNet 

1.1. Some remarks on GermaNet 

The fundamental lack of electronic lexical-semantic resources for German (see Hamp & Feldweg 
(1997)) was the major motivation for constructing GermaNet a few years ago. Therefore, a first 
project (SLD) created anon-line thesaurus covering the German basic vocabulary. GermaNet adopted 
the design principles and the database technology from the Princeton WordNet. However, the German 
wordnet is not a mere translation of WordNet, but was built from scratch, taking into account different 
lexicographic resources like DUDEN 8 and DEUTSCHER WORTSCHATZ as well as corpus 
evidence. GermaNet includes principle-based modifications on the constructional and content- 
oriented level. 

GermaNet currently covers some 40,000 synsets with more than 60,000 word meanings, modelling 
nouns, verbs, adjectives and adverbs (see Kunze (2001)). Within the EuroWordNet project, GermaNet 
was integrated into the polylingual EuroWordNet database (see Vossen (1999), Wagner and Kunze 
(1999)). We followed the merge approach, i.e., a wordnet is built independently from WordNet and 
the synsets are linked to the Interlingual Index (ILI) by creating the appropriate relations. The merge 
approach preserves language-specific patterns with differing hierarchical structures in comparison to 
WordNet 22 . 

The construction of GermaNet as well as the integration of GermaNet into EuroWordNet have been 
performed following independent principles, thus leading to differences with respect to the Princeton 

WordNet. 

1.2. Major differences to WordNet 

In spite of its general similarity with and compatibility to WordNet, we can state the following 
differences for GermaNet: 

• we followed linguistic design principles as opposed to Princeton's psychological motivation, 

• we are using artificial, i.e. non-lexicalised concepts, which have been introduced to balance the 
taxonomical structure more adequately; 

• in GermaNet, adjectives are ordered hierarchically as opposed to Princeton's grouping by the 
satellite approach; 

• we pursued a uniform treatment of meronymy within GermaNet, whereas WordNet has 
established three different pointers for Part, Member and Substance ; 

• within GermaNet, the causation relation can be encoded between all parts of speech, not only 
between verbs and adjectives; 

• due to emphasising the syntax-semantics-interface for disambiguation tasks we accounted for 
some sixty verbal subcategorisation frames. These frames are more elaborate than the WordNet 
frames, and, furthermore, for each verb reading we provide a typical example. 

These differences and their technical impact on compatibility for the XML conversion are outlined in 
more detail below. 


22 In contrast, some language-specific wordnets were integrated via the expand approach, in which WordNet synsets were 
translated into the language in question. Consequently, the relational structure was adopted and therefore highly biased by 
WordNet. 


17*5 



2. Converting GermaNet to XML 

The GermaNet lexicographer’s files constitute the starting point of the conversion process 23 . 

The Entity-Relationship graph below visualises the structure of GermaNet and guides the conversion 
process (see fig. 1). 

From this graph we can easily deduce: 

1. the objects, which arc synscts and lexical units 24 , represented as rectangles, 

2. the attributes of these objects, represented as circles, 

3. the relations, represented as diamonds: 

a) conceptual relations which hold between instances of the synset object and 

b) lexical-semantic relations which hold between instances of the lexical unit object. 

The conversion programmes are written in Perl. There are two main programmes: 

1. one transforms the strings in the lexicographer’s files into elements and their attributes, while 
checking for consistency. The transformation is structure-preserving. 

2. the other checks the input files for conformancy, attaches unique IDs to all object instances and 
separates the object information from the relation information. From one input file, it generates 
two output files: one file containing the GermaNet objects, the other file containing the GermaNet 
relations. Furthermore, an index of all synsets and an index of all lexical units with their IDs are 
created. 



Norte: CR * conceptual ielalfon; LS!i= lexical leoiaaiic Jelation; oV =oJiJios>aptiir valiant 

Figurel: Entity-Relationship graph of the GermaNet structure 
3. XML DOCUMENT TYPES AND THEIR DEFINITIONS 

We decided to treat objects and relations with their respective attributes separately. This leads to two 
separate types of documents which are described by two separate document type definitions. In one 
document type we deal with the objects (synsets and lexical units) and their attributes, in the other 
document type with the relations and their attributes. 

Thus the relations are easier to manage and an application-specific processing semantics can be 

23 These files arc the repositories in which the output of the lexicographers' work is stored. 

24 Following Cruse (1986), a lexical unit constitutes an independent form-meaning pair. 


176 













attached once there are web-based tools, browsers etc., which sire able to handle XML links properly. 
Second, it will be straightforward to connect one wordnet to other wordnets: the linking mechanism 
will be just another type of relation, now linking synsets and/or lexical units of different languages. 

In the following, we present the DTDs and discuss a few design decisions. The DTDs and the 
documents which are generated according to them conform to the XML 1.0 and the XLink 1. 
recommendations. 

A. The Object DTD 


<!- DTD for GennaNet objects (synsets, lexical units) --> 

<!— version 1.4, July 2001 --> 

<!- Copyright: Seminar f. Sprachwissenschaft, Universitat Tubingen --> 
<!ELEMENT synsets (synset)*> 

<!ELEMENT synset ((lexlinit)+, frames?, paraphrases?, examples?)> 

<!ATTLIST synset id ID #REQUIRED 

wordClass CDATA #IMPLIED> 

<! ELEMENT lexUnit (orthForm)> 


<!ATTLIST lexUnit id ID #REQUIRED 

StilMarkierung (ja|nein) ’nein’ 
sense CDATA #REQUIRED 
orthVar (jajnein) #REQUIRED 
artificial (ja|nein) #REQUIRED 
Eigenname (ja|nein) #REQUIRED > 

dELEMENT orthForm (#PCDATA)> 

<!ELEMENT paraphrases (paraphrase)+> 

<!ELEMENT paraphrase (#PCDATA)> 

<! ELEMENT examples (example)+> 

<!ELEMENT example (text, frame*)> 

<!ELEMENT frames (frame)*> 

<!ELEMENT text (#PCDATA)> 
t <!ELEMENT frame (#PCDATA)> _ 


Description 

The document contains a set of synsets. Every synset consists of at least one lexical unit. Paraphrases 
may be given to characterise the meaning of the synset and examples may be added to illushate the 
use of its member lexical units. For verb synsets, subcategorisation frames are given. The individual 
lexical units are characterised by a set of attributes, e.g. sense number and stylistic marker 
0 StilMarkierung ). A concept can be represented by a string which does not correspond to a >« lcal ™ 
In the German vocabulary. In this case the unit is marked as artificial. The content model of most 
atomic elements is set to #PCDATA, therefore minimising data type restrictions. It is up to the 
lexicographers to fill the elements with appropriate data . 


25 The full DTDs and their documentations are available on request 


177 



B. The Relation DTD 


<!-- DTD for GermaNet relations (conceptual and lexical-semantic) —> 

<!— version 1.0, July 2001 —> 

<!— Copyright: Seminar f. Sprachwissenschaft, University Tubingen --> 

<ELEMENT relations (hyperonym | holonym | see | entails | causes | meronym | hyponym | antonym 
pertonym j participleOf j derivedFrom)*> 

<!ELEMENT hyperonym (#PCDATA|locator|arc)*> 

<!ATTLIST hyperonym xmlnsrxlink CDATA #FIXED 'http://www.w3.org/1999/xlink' 
xlink:type (extended) #FDCED 'extended 1 
sense CDATA #REQUIRED> 

.... (the same structure for all other relations) 

<{ELEMENT locator> 

<!ATTLIST locator xlink:type (locator) #FIXED 'locator' 
xlinkihref CDATA #REQUIRED 
xlink:label CDATA #REQUIRED> 

<!ELEMENT arc> 

<!ATTLIST arc xlink:type (arc) #FIXED 'arc' 
xlink:from CDATA #REQUIRED 
xlink:to CDATA #REQUIRED 
xlinkiactuate (onRequest) #FIXED 'onRequest' 

xlink:show (other) #FIXED 'others _ • _ 


Description .. 

The ‘GermaNet relations’ DTD models different types of relations (hyperonymy, antonymy etc.) 
which hold either between synsets or between lexical units. A link consists of two nodes (locators, 
specified through the IDs of the synsets or lexical units) and an arc. The attributes of the ‘arc’ element 
specifies the processual behaviour whenever a link is traversed. Of course, appropriate software 
(browser, visualiser or else) is needed to implement these features. However, with XLink having 
reached the status of a W3 recommendation, its integration into standard software will be realized in 
the near future. 

4. Compatibility Issues 

A common access interface to the wordnets of all languages is highly desirable. NLP programmes 
which incorporate one or several wordnet(s) should rely on a uniform access method. Thus, it follows 
that all wordnets should be structurally unified, but without sacrificing the language-specific 
extensions and changes to the original structure of the Princeton WordNet. It is therefore necessary to 
consider compatibility between all wordnets. In this section, we will compare GermaNet to the 
Princeton WordNet, which is the original and best documented resource, raise several compatibility 
issues and show how they can be solved within the XML framework 26 . We distinguish six types of 
structural differences between WordNet and GermaNet: 

1. Objects which are obviously of the same type bear different names. This can be easily solved by a 
mapping of these different names. The same holds for the names of relations and attributes. 

2. Objects or relations have different extensions in both nets, as is the case with the CAUSE relation. 
In WordNet, this relation holds exclusively between verbs and adjectives. In GermaNet, synsets 
of all word classes are in the domain of this relation. True compatibility would require a finer 
granularity of the CAUSE relation in GermaNet. This could be realised by adding an attribute to 
it. The values of this attribute would lead to at least two subsets of items: one which is 

26 Alternatively, the EuroWordNet specification could serve as starting point for a uniform XML representation across 
languages. We decided, however, to focus on the mapping between GermaNet and WordNet in order to capture all 
individual properties of GermaNet and its full coverage of data. 


178 






extensionally identical with the original CAUSE relation and one which characterises the 
GermaNet-specific extension. 

3. The granularity of a relation differs. For example, WordNet divides the generic part-whole 
relation into three sub-relations: part (e.g. arm,body), member (e.g. director, staff), substance (e.g. 
glass, glass plate). Other values might be added to this list. GermaNet, in contrast, uniformly 
applies the generic relation. We recommend for WordNet to add an attribute to a truly generic 
part-whole-relation which divides the instances into three classes. In GermaNet, this attribute 
might get a value ANY, until a more fine-grained specification is implemented. 

4. There are a few attributes specific to GermaNet, e.g. StilMarkierung (=stylistic marker) as an 
attribute of lexical units. For instance, the German concept schlafen (=sleep) has ratzen *s, 
pennen*s, knacken*s, pofen*s as hyponyms which are stylistically marked. These attributes can 
be INCLUDED in GermaNet and EXCLUDED elsewhere. 

5. An attribute which is equivalent in both wordnets specifies a different set of values. This holds for 
the verb frame attribute. The German verb frames which are implemented in GermaNet are a 
closed class. For type checking, it could have been more elegant to define an attribute with a fixed 
set of values. For compatibility reasons, however, we voted for an element group ’’frames" with 
frames as its elements and #PCDATA as data type. We delegate the type checking to other 

routines. 1 : * ‘ r ;; . 

6. The adjective domain in GermaNet differs fundamental 1)' from that in WordNet. The domain is 
ordered hierarchically in GermaNet, whereas WordNet applies an associative similarity relation 
which groups adjectives in equivalence classes. At present, we do not see any easy solution which 
would preserve bompatibility in this case. 

5. Examples 

In the following we will present two examples to illustrate the XML format of the data. The first example 

presents a verbal synset, the second one a link between a verb and its hyperonym. 


<?xml version="1.0" encoding="UTF-8" standalone="no"?> 

<!DOCTYPE synsets SYSTEM "germanet_objects.dtd”[]> 

<synset id-'vKoerperftmktion.262" wordClass="verben’’xlexUnit Eigenname- nein 
artificial-'nein" id="vKoerperfunktion.262.perinen" orthVar=="nein" sense="0" stilMarkierung="]a"> 
<orthForm>pennen</orthForm></lexUmt> '• ■ 

dexUnit Eigenname="nein" artificial="nein" id="vKoerperfunktion.262.knacken" orthVar= nein 

sense=’'0"stilMarkierung="ja’ , xorthForm>knacken</orthFormx/lexUnit> 

dexUnit Eigenname="nein" artificial-'Hein" id="vKoerperfunktion.262.ratzen" orthVar= nein 

sense-'0" s tilMarkierung="ja"><orthForm>ratzen</orthFonnx/lexUnit> 

dexUnit Eigenname="nein" artificial" nein" id="vKoerperfunktion.262.pofen" orthVar= nein 

sense="0" stilMarkierung-"ja"><orthForm>pofen</orthForm></lexUnit> 

<framesxframe>NN</frame><frame>NN.PP</framexframe>NN.BM</frame></frames> 

<examplesxexample><text>£r pennt auf der Couch.</ text><frame>NN.PP</framcx/example> 
<exampleXtext>.S7e knackt 5 cAo«.</textxframe>NN</ffamex/example> 

<examplextext>£r ratzt wie ein Stein .</textxframe>NN.BM</ffameX/examplex/examples> 
</synset>_______—- 


179 





<?xml version*" 1.0" encoding="iso-8859-l"?> 
<!DOCTYPE relations SYSTEM "germanet_relations.dtd'*> 
<relations> 


<hyperonym sense="0" x^llns:xlink="http://www.w3.org/1999/xlink ,, xlink:type =,, extended"> 
<locator xlink:type="locator" xlink:href="verben.Koeiperfimktion.xml#vKoerperfunktion.259" 
xlink:label="vKoerperfunktion.2597> <locator xlink:type="locator" 
xlink:href="verben.Koerperfunktion.xml#vKoerperfunktion.262" 

xlink:label="vKoerperfunktion.2627> <arc xlink:type="arc" xlmk:from="vKoerperf^lnktion.262 ,, 
xlink:to="vKoerperfunktion.259" xlink: actuate-’onRequest" xlink:show="other"/> schlafen 
</hyperonym>___ 


Conclusions 

We have developed an XML-conformant representation of GermaNet, aiming at accessibility of our 
large-scale semantic database on the web. This conversion, which has been described in this paper, 
accounts for the various peculiarities of the German wordnet while proposing a common effort of 
wordnet builders to standardise XML-versions of language-specific resources. 

Compatibiliy issues have been explored in comparison to the Princeton WordNet, which constitutes 
the pioneer resource with the broadest coverage of concepts and its inherent interlingual function. A 
uniform representation could be very fruitful in view of various cross-lingual language processing 
tasks as well as polylingual research issues. 

Acknowledgements 

Our thanks go to Iris Vogel and Holger Wunsch, our students, who performed a good deal of the 
programming work, to Karin Naumann and Andreas Wagner for their help with structural issues and 
to Erhard Hinrichs who has always been an active supporter and a competent advisor. 

References 

Cruse, A. (1986) Lexical Semantics. Cambridge: Cambridge University Press. 

Deutscher Wortschatz. Ein Wegweiser zum treffenden Ausdruck. Eds.: Wehrle, H. and Eggers, H. 
Stuttgart: Ernst Klett Verlag, 1961. 

Duden 8: Sinn-und sachverwandte Worter. Eds.: Drosdwoski, G. and Muller, W. and Scholze- 
Stubenrecht, W. and Wermke, M. Mannheim et al.: Dudenverlag, 2 1986. 

Fellbaum, C. (1998) WordNet: An Electronic Lexical Database. Cambridge, Mass.: MIT Press. 

Hamp, B. and Feldweg, H. (1997) GermaNet - a Lexical-Semantic Net for German. In: Proceedings of. 
the ACL/EACL-97 workshop on Automatic Information Extraction and Building of Lexical Semantic 
Resources for NLP applications. Madrid, July 7-12,1997. 

Kunze, C. (2001) Lexikalisch-semantische Wortnetze. In R. Klabunde et al. (eds.): Computerlinguistik 
und Sprachtechnologie - Eine Einftlhrung. Spektrum Verlag Heidelberg, pp. 386-393. 

Manning, Christopher D. / Jansz, Kevin / Indurkhya (2001) Kirrkirr - Software for Browsing and 
Visual Exploration of a Structured Warlpiri Dictionary. In: Literary and Linguistic Computing 16, 2, 
pp. 135-151. 

Miller, G. et al. (1990) Five papers about on WordNet. CSL-Report, Vol. 43. Cognitive Science 
Laboratory, Princeton University. 

Vossen, P., ed. (1998) EuroWordNet: A Multilingual Database with Lexical Semantic Networks , 
Kluwer Academic Publishers, Dordrecht. 

Vossen, P. (1999) EuroWordNet. Building a Multilingual Database with Lexical-Semantic Networks 
for the European Languages. In: Proceedings of EUROLAN'99, 4th European Summer School on 
Human Language Technology. Iasi, Romania. July 19-31, 1999. 


180 




Wagner, A. and Kunze, C. (1999) Integrating GermaNet into EuroWordNet, a Multilingual Lexical- 
Semantic Database. In: Sprache und Datenverarbeitung, SDv Vol. 23.2/1999, pp. 5-20. 

Xlink 1.0.: URL=http://www.w3.org/TR/2001/REC-xlink-20010627/ 

XML 1.0.: URL=http^/www.w3.org/TR/1998/REC-xml-19980210/ 

Claudia Kunze, Abteilung Computerlinguistik, Seminar f. Sprachwissenschaft, Univ. Tubingen, Wilhelmstr. 113, 
D-72074 Tubingen, Germany, kunze@sfs.uni-tuebinqen.de 

Lothar Lemnitzer, Abteilung Computertinguistik, Seminar f. Sprachwissenschaft, Univ. Tubingen, Wilhelmstr. 
113, D-72074 Tubingen, Germany, lothar@sfs.uni-tuebingen.de 


181 



Visualizing WordNet Structure 


Jaap Kamps 


Abstract 

Representations in WordNet are not on the level of individual words or word forms, but on 
the level of word meanings (lexemes). A word meaning, in turn, is characterized by simply 
listing the word forms that can be used to express it in a synonym set (synset). As a result, 
the meaning a word in WordNet is determined by its sets of synonyms. This is essentially a 
recursive definition of word meaning. Hence meaning in WordNet is a structural notion: the 
meaning of a concept is determined by its position relative to the other words in the larger 
WordNet structure. We have implemented a set of scripts that visualize the WordNet structure 
from the vantage point of a particular word in the database. 


1 Introduction 

This paper report on visualization tools for Princeton’s WordNet lexical database (Miller, 1990; Fell- 
baum, 1998). One of WordNet’s greatest assets is its wide coverage of the English language. The 
down-side of this coverage is that it is highly non-trivial to get a good overview of particular parts of 
the lexical database. We want to visualized the WordNet structure from the vantage point of a particu¬ 
lar word in the database. We will focus here on WordNet’s main relation, the synonymy or SYNSET 
relation. The notion of meaning used in WordNet is lexical meaning, and the SYNSET relation denotes 
coincidence of lexical meaning. So our goal is to visualize parts of the WordNet SYNSET structure. 

2 Relatedness and Minimal Path-Length 

The first problem we face it that simply plotting all SYNSET relation immediately results in a knotted 
graph that fails to provide insight in the underlying WordNet structure. That is, we need to find a way 
that abstracts from the synonymy relation while still preserving the WordNet structure. For this reason, 
we investigate distance measures. 

We will define the notion of n-relatedness based on the SYNSET relation (this is similar to the 
graph-theoretic notion of connectedness). 

Definition 1 7Vvo words wo and w n are n-related if there exists an (n + 1 )-long sequence of words 
(wq,w\ } ... ,i u n ) such that for each i from 0 to n — 1 the two words Wi and Wi+i are in the same 
SYNSET. 

For example, the verbs ‘be’ and ‘endure’ are 2-related since there exists a 3-long sequence 
(be, live, endure). Two words may of course be related by many different sequences, or by none at 
all. We will only be interested in the shortest sequences relating words. 

Definition 2 Let MPL be a partial function such that MPL(u>i, Wj) — n if n is the smallest number 
such that Wi and Wj are n-related. 

If there is no sequence relating the two words, then the minimal path-length is undefined. 

The minimal path-length enjoys some of the geometrical properties we might expect from a distance 
measure. 

Observation 1 The minimal path-length is a metric, that is, it gives a non-negative number 
MPL(iUt, Wj ) such that 



Figure 2: The WordNet database from the vista point of verb ‘be’ and maximal MPL of 2. 



183 













i) MPL(iu<,iUj) = 0 if and only if Wi = Wj, 

ii) MPL(tUi, twj) = MPL(ttfj,w<), and 

iii) MPL(tUi,ti;^) + MPL(ti;j,iUfc) > MPL(tw<,u/fc). 

The minimal path-length is a straightforward generalization of the synonymy relation. For example, 
using WordNet we now find that 

1. MPL(be,live) = 1, 

2. MPL(be, endure) = 2, 

3. MPL(be, suffer) ?= 3, and 

4. MPL(be, lose) = 4.' 

Our strategy will be to start with a particular word, and draw the graph of words upto a certain MPL. 
This makes sense considering the SYNSET relation in WordNet is representing similarity of meaning, 
and our MPL is a straightforward generalization of the SYNSET relation. So the resulting graph still 
preserving the crucial WordNet structure. 

3 Implementation 

Our implementation consists of three major ingredients: 

1. For a given word we can derive SYNSET related words from Princeton WordNet 1.7. 

2. We use Perl and Dan Brian’s Lingua::wordnet module for efficiently deriving sets of words upto 
a given MPL. 

3. We generate output in the appropriate form for the sUmdard Java Graph, java class. This java 
class for the lay-out of graphs will take care of the visualization proper. 

The only new part is a Perl script that can efficiently generate related words by their MPL. The script 
starts with a particular word (such as ‘be’) and recursively generates all synonyms while filtering away 
words it has encountered earlier. That is, we start with a particular word w (i.e., having minimal path- 
length zero to itself), then generate all words iwith MPL(iu,iOi) = 1, then with MPL(iu, in*) = 2, 
etcetera, until the search exhausts, or until we reach a given maximal value of MPL. For every new 
word, we also keep track of the word whose synonym it is. When finished, we will simply add a node 
for each word, and draw an edge to the word whose synonym it is. That is, for each related word, we 
only add one edge corresponding to (one of) its minimal paths to the initial word. In this way, the graph 
visualizes a small subset of the SYNSET relation, precisely those that give rise to minimal paths. To 
make the graph more appealing, we can influence the length of this edge by giving it a weight, which 
we let decrease as a function of the MPL. This list of nodes and edges is fed to the Graph Layout Java 
script, which will take care of the actual visualization. 

Consider that we want to know the WordNet structure in the neighborhood of the verb ‘be.’ Figures 1 
and 2 show screendumps of the graphs for MPL < 1 and < 2, respectively. The Java script is dynamic in 
the sense that the nodes can be manipulated. Although the graphs based on MPL are much sparser than 
the full SYNSET relation, the graphs get crowded when we increase the maximal MPL. This is simply 
due to rapid increase in the number of words, see for example figure 3. For this reason, the script has an 
additional argument that allows us to ignore words with a low word familiarity or polysemy count. See 
figure 4 for the same part of the WordNet lexical database, while filtering away words with polysemy 
count < 4. 


185 



4 Conclusions and Discussion 

One of the original design principles of WordNet is the use of a differential theory of lexical semantics 
(Miller, 1990). Representations in WordNet are not on the level of individual words or word forms, but 
on the level of word meanings (lexemes). A word meaning, in turn, is characterized by simply listing 
the word forms that can be used to express it in a synonym set (synset). As a result, the meaning a word 
in WordNet is determined by its sets of synonyms. This is essentially a recursive definition of word 
meaning. Hence meaning in WordNet is a structural notion: the meaning of a concept is determined by 
its position relative to the other words in the larger WordNet structure. 

In this paper, we discussed the visualization of WordNet structure from the vantage point of a partic¬ 
ular word. That is, we want to position ourselves on a particular word, and overview the larger structure 
of WordNet from there. This approach reminds of the perspective of modal operators in logic (Black- 
bum et al., 2001). This way of visualizing local parts of the WordNet database has proven its use for 
testing and evaluating WordNet similarity measures (Kamps and Marx, 2001). 

Since WordNet’s main SYNSET relation is too rich to allow for direct visualization, we focused 
on its straightforward generalization, the minimal path-length—a distance metric. Such measures of 
distance, similarity, or relatedness are well-known in natural language processing. The use of path- 
length as similarity metric also discussed in (Rada et al., 1989). 

The basic notion of meaning used in WordNet is lexical meaning, and WordNet’s main SYNSET 
relation is denoting coincidence of lexical meaning. Interestingly, WordNet is partly inspired by psy- 
cholinguistic theories of human lexical memory. That is, the meaning of words is also determined by its 
place in the larger structure of the database. Also note that this larger structure shows some resemblance 
with our own lexical memory. This may explain some of the intuitive appeal of the generated graphs. 

Acknowledgments This research was supported by the Netherlands Organization for Scientific Re¬ 
search (NWO, grants # 400-20-036). Thanks to Patrick Blackburn, Maarten Marx, Michael Masuch, 
Rob Mokken and Ivar Vermeulen for their comments. All data is derived from Princeton WordNet 1.7, 
using Perl and Dan Brian’s excellent Lingua::wordnet module, and the Graph, java class. 

On-line examples of the WordNet visualization scripts are available at the following URL: 
http : //www.illc.uva.nl/~kamps/wordnet/. 

Institute for Logic, Language and Computation 
University of Amsterdam 
Nieuwe achtergracht 166 
1018WV Amsterdam 
the Netherlands 

karops@illc.uva.nl 

References 

Patrick Blackburn, Maarten de Rijke, and Yde Venema. 2001. Modal Logic , volume 53 of Cambridge Tracts in 
Theoretical Computer Science. Cambridge University Press, Cambridge UK. 

Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database, Language, Speech, and Commu¬ 
nication Series. The MIT Press, Cambridge MA. 

Jaap Kamps and Maarten Marx. 2001. Words with attitude. Technical Report PP-2001-16, Institute for Logic, 
Language and Computation, University of Amsterdam. 

George A. Miller. 1990. WordNet: An on-line lexical database. International Journal of Lexicography , 
3(4):235-312. Special Issue. 

Roy Rada, Hafedh Mili, Ellen Bicknell, and Maria Blettner. 1989. Development and application of a metric on 
semantic nets. IEEE Transactions on Systems, Man, and Cybernetics , 19:17-30. 


186 



Structured Access to Scientific Information 


Caterina Caracciolo 
Maarten de Rijke 


Abstract 

We report on an ongoing project aimed at providing an exemplary architecture for an elec¬ 
tronic dissemination environment for scientific handbooks. We focus on our way of facilitat¬ 
ing navigation through and access to electronic handbooks by using a WordNet-like concept 
hierarchy consisting of synsets that are connected to each other and to external sources by 
semantic relations for navigational purposes. 


1 Introduction 

Electronic publishing offers many opportunities, for readers, authors, and publishers alike. While tech¬ 
nical reports, conference proceedings, and journals are increasingly being made available in an elec¬ 
tronic form, sometimes even exclusively, other kinds of scientific publications have, by and large, not 
been recast for electronic dissemination yet. In particular, for scientific knowledge of a unifying kind, 
as traditionally found in a handbook, several proposals for a suitable architecture are only now being 
tried out or explored. 

There are many reasons that justify electronic versions of scientific handbooks such as the Handbook 
of Logic and Language (van Benthem and ter Meulen, 1997) or the Handbook of Automated Reason¬ 
ing (Robinson and Voronkov, 2001). It makes distribution easier and quicker, and readers can be helped 
considerably when searching for information: even simple keyword searches are more useful than scan¬ 
ning tables of contents or indexes, especially for large handbooks, and tracking down a reference can be 
as simple as a mouse click. Also, electronic publications arc less rigid than their paper counterparts: the 
publication can be adaptable to the reader, thus better satisfying her information needs. Electronic avail¬ 
ability also facilitates integration with other media types: e.g., computer simulations and visualizations, 
movies, and tools. 

Electronic books facilitate a more modular way of reading than traditional paper books. Indeed, many 
web sites consist of many relatively small modules connected by hyperlinks. (Harmsze, 2000) proposes 
a modular structure for articles in experimental sciences, but it is not clear whether this approach can be 
adapted to handbooks that contain more abstract content. A potential problem with the ‘small modules 
- many links’ structuring of information is disorientation of the reader (Conklin, 1987). A reader should 
know where she is in the hypertext network, and how to get to other locations: high quality navigation 
tools are essential. 

In our approach to the development of an electronic dissemination environment for scientific hand¬ 
books, we intend to facilitate navigation through and access to electronic handbooks by using a 
WordNet-like concept hierarchy. It consists of synsets that are connected to each other and to external 
sources by semantic relations. 

2 Strategy 

Topic or concept hierarchies are often used for the purpose of navigating through large collections 
of documents. They are very useful for the organization, display and exploration of large amounts 
of information. Well-known examples include Yahoo!’s topic hierarchy for exploring the Web, and 
Google’s directories (based on the DMOZ open directories initiative). 



One of the advantages of using concept hierarchies is that users do not have to know exactly what 
information they arc looking for. Without having to phrase their information need in precise terms, they 
can browse from general to more specific categories, or from example to counterexample, and thus get a 
clearer idea of the information being sought. This is especially helpful for readers that are less familiar 
with the topic for which they consult the handbook. 

It has been shown that users in a hypertext search task who had hierarchical browsing patterns 
through the hypertext performed better than users who had sequential browsing paths (McEneaney, 
1999). Therefore, it is very important that architectures for electronic handbook allow, or even enforce, 
such hierarchical patterns, and a concept hierarchy is a good way of doing this. 

Based on these considerations, it was decided to investigate the use of fine-grained concept hierar¬ 
chies for navigation through and access to scientific handbooks within the Logic & Language Links 
project. To make matters concrete, the project aims to develop an electronic version of the Handbook 
of Logic and Language (van Benthem and ter Meulen, 1997). However, the envisaged results and our 
discussion below are applicable to many other domains. 

3 Organization of the Hierarchy 

The building blocks of the concept hierarchy are the concepts, and its cement consists of several se¬ 
mantic relationships. In line with WordNet (Felibaurn, 1998), we make a distinction between words 
or terms on the one hand, and concepts on the other hand: a concept is denoted by a synset, a set of 
synonymous words. Words are synonymous if they have (more or less) the same meaning in some 
setting. The semantic relationships come in two kinds: internal to concept hierarchy, and ones that link 
the concepts to external resources. 

3.1 Internal Architecture 

Concepts in the hierarchy are annotated with a gloss: for instance. The study of language meaning is a 
gloss for semantics. Moreover, they come with multiple, increasingly more technical descriptions, only 
one of which will be served to an individual reader, depending on her level of expertise. 

The main semantic relationship that structures our concept hierarchy is related subtopic, or simply 
subtopic. While it covers the usual ‘is-a-kind-of relation (e.g., epistemicJogic is a related subtopic of 
modal Jogic in this sense), the subtopic relation extends it in a number of ways. For instance, we aliow 
the subtopic relation to cover meronymic cases as well: the relationship between compactness and logic 
(in the sense of ‘a system of reasoning’) is of this kind. 

We don’t require that the concept hierarchy be a strict tree. Except the root, every node may have 
multiple parents. We do not allow cycles in the subtopic hierarchy, since cycles disorientate readers. 
Moreover, we don’t allow any concept to be unconnected to the rest of the hierarchy. At present, every 
concept is below one of four ‘beginners’ ( computer science, mathematics, linguistics, philosophy). 

As to additional (non-hierarchical) navigational relations, these include the following: 

Similarity: concepts are similar if they share some properties or are somehow analogous to each 
other. For instance, finite state machine is similar in this sense to regular language. 

Antonymy: learning the antonym of a concept not only teaches us more about the meaning of the 
antonym, but also about the concept itself (Muehleisen, 1997). 

Sibling: informal experiments have convinced us that readers find it useful to know what the siblings 
of a given concept are; it helps prevent the ‘lost in space’ problem. 

Other meanings: for every name in the synset of a given concept, we provide links to other concepts 
in whose synset the name string occurs. 

3.2 External Connections 

In addition to the internal links, our concept hierarchy has external links in the sense that they are 
between concepts and targets outside the hierarchy. We distinguish between handbook links (to infor- 


188 



mation in the handbook but outside the concept hierarchy), and web links (to information sources on the 
web). Again, we make this distinction for the benefit of the reader; we have found that it is important 
for a reader to know whether a link target is outside the space controlled by us. 

The target of a handbook link can be of different levels of granularity (a part, a chapter, a subsection, 
a definition, etc.). Ideally, concepts higher in the hierarchy refer to larger fragments in the handbook, 
while lower concepts refer to smaller parts. However, as the handbook chapters are written by different 
authors, resulting in a different structuring and writing style for every chapter, this is hard to achieve. 

Internal and external links in the concept hierarchy take advantage of a set of metadata that is auto¬ 
matically generated. Internal links are established using the references given by the unique identifier 
associated to each node. 

Handbook links come with metadata describing crucial information about the publication linked 
(e.g., author, editor, publisher), enriched with an indication of the link type (e.g., definition, theorem, 
discussed-in, example/counterexample). As to web links, they too come in a small number of types, 
including research group, home pages, tools (links to software related to the concept; for example, 
information-retrieval has a link to the mg system), and publications. 

We mostly adhere to the Dublin Core, even though in our approach a large part of the metadata 
suggested by the Dublin Core plays the role of actual data. Even data about authors and editors of the 
concept hierarchy is meant to be available as user user, Data about creation/modification of concepts in 
the concept hierarchy has a somehow different role. At present it is metadata, but in a future scenario it 
will be user data as well. 

4 Building the Hierarchy 

The Logic & Language concept hierarchy is currently being built, by hand. Our efforts are based on 
The Bluffer’s Guide to Computational Semantics (Fracas, 1996), and the glossary of the Handbook of 
Logic and Language (Groeneveld, 1997). 

At present the focus of our work on constructing the hierarchy is on creating the hierarchical rela¬ 
tionships, complete with glosses and internal relations; extensive descriptions and external links have 
mostly been left out at this stage. Domain experts at the authors’ home institute are about to be involved 
in the process of building the hierarchy as large-scale community building effort. 

4.1 The Current Prototype 

The current version of our hierarchy is populated with close to 1000 concepts, provided by us. For 
every concept we maintain a single XML file. From these XML files a Logic & Language web site is 
generated at regular intervals, to incorporate changes to the underlying XML files. 

We are currently setting up a web-based interface to enable domain experts to easily add further 
concepts or modify existing ones. Editors are being approached to take responsibility for subparts of 
the hierarchy (such as computational-logic or dynamic-semantics ). We plan to launch a version of the 
Logic & Language concept hierarchy on a publicly accessible web server in early 2002. 

4.2 Support Tools: Hierarchy Development 

In our approach it is essential that the concept hierarchy be constructed by a team of human editors 
to guarantee high quality. Nevertheless, this activity is time consuming and error prone. For this 
purpose, in constructing the hierarchy and in linking it, we intend to provide authors with tools that 
make suggestions and check for coherence of the inserted data. 

Completeness and correctness are two important criteria in our hierarchy development efforts: the 
concepts in the hierarchy should cover the information in the domain covered by the handbook, and 
they should only cover information in that domain. We have carried out two kinds of experiments 
to help us ensure these criteria. First, we have used ideas based on inverse document frequencies of 
terms in collections of arbitrary scientific papers versus collections of papers in logic and language area. 


189 



Second, we have explored methods for automatically generating concept hierarchies. Research on the 
latter comes in three flavors: pattern matching, based on partial parsing, and based on statistics and 
cooccurrence. Well-known work on using methods based on pattern-matching for extending WordNet 
were described by (Hearst, 1998), while (Manning, 1993) is an example of work based on partial 
parsing. (Sanderson and Croft, 1999) aim to generate a hierarchy with the same subtopic relation we 
employ in our concept hierarchy, a mixture of hypemymic and meronymic relationships. In response 
to a suitable query, they consider the set of 500 top-ranked documents. From these a term collection is 
built based on similarity to the query, which is then used to build the hierarchy. 

We have carried out small-scale experiments with the Sanderson and Croft algorithm. To be applied 
to the construction of a concept hierarchy for the Logic & Language area, some adaptations had to be 
made. First, we needed a sufficiently large corpus of papers in the area. Second, we do not have a query 
whose terms can be used as potential concepts in the hierarchy; we simply use hierarchy terms that are 
already present. 

Having automatically generated a hierarchy, we may be faced with the need to merge (parts ot) it 
with the .existing hand built one. Recent work on ontology development and enlargement, such as 
Chimaera (McGuinness et ai., 2000) and Prompt (Noy and Musen, 2000), is particularly relevant to us 
for this purpose. 

43 Support Tools: Linking the Hierarchy 

Finally, let’s turn to the task of generating links from the concept hierarchy, which is another natural 
task begging for automation. So far, we have experimented with automatically generating hypertext 
links from concepts in the hierarchy to (electronic versions of) the chapter in the handbook. We used 
the vector space model, exploring a variety of options. As the documents to be retrieved, we have 
taken pages of the original handbook; while arbitrary, this choice was forced upon us by the diversity 
of the writing styles of the contributing authors. Some preliminary experiments indicated that cosine 
similarity provides the best weighting scheme for this setting, with normalization for the queries, but 
not for the documents. For the queries we explored several possibilities (term, term plus description, 
term and description plus additional weightings on the term). We have found that best retrieval results 
were obtained by factoring the key terms (taken from the concept hierarchy) with a higher constant than 
the descriptions of the terms (Monz et al., 2000). 


5 Conclusion 

We have reported on ongoing work aimed at providing an exemplary architecture for an electronic 
dissemination environment for scientific handbooks. We focused on facilitating navigation through and 
access to electronic handbooks by means of a WordNet-like concept hierarchy consisting of synsets 
connected to each other and to external sources by various semantic relations. 


Acknowledgments This research was supported by Elsevier Science Publishers. The second author 
was supported by the Spinoza project ‘Logic in Action’ and by grants from the Netherlands Organi¬ 
zation for Scientific Research (NWO), under project numbers 612-13-001, 365-20-005, 612.069.006, 
612.000.106, and 220-80-001. We thank Joost Kircz and the referees for helpful comments and sug¬ 


gestions. 


Institute for Logic, Language and Computation 
University of Amsterdam 
NIEUWE ACHTERGRACHT 166 
1018WV Amsterdam 


the Netherlands 

(caterina,mdr}@science.uva.nl 


190 



References 


J. Conklin. 1987. Hypertext: An introduction and survey. IEEE Computer , pages 17-41. 

C. Fellbaum, editor. 1998. WordNet, an Electronic Lexical Database. MIT Press. 

L. R.E. Fracas. 1996. The bluffer’s guide to computational semantics, January. 

W. Groeneveld. 1997. Logic and language: A glossary. In (van Benthem and ter Meulen, 1997), pages 1179- 
1213. 

F. Harmsze. 2000. A Modular Structure for Scientific Articles in an Electronic Environment. Ph.D. thesis, 
Universiteit van Amsterdam. 

M. A. Hearst. 1998. Automated discovery of WordNet relations. In (Fellbaum, 1998), pages 131-151. 

C. D. Manning. 1993. Automatic acquisition of a large subcategorization dictionary from corpora. In Proc. 31st 
ACL , pages 235-342. 

J.E. McEneaney. 1999. Visualizing and assessing navigation in hypertext. In Proc. 10th ACM Conference on 
Hypertext and Hypermedia , pages 61-70. 

D. L. McGuinness, R. Fikes, J. Rice, and S. Wilder. 2000. The Chimaera ontology environment. In Proc. AAAI 

2000. 

C. Monz, J. Ragetli, and M. de Rijke. 2000. Concept-based computer-aided link generation for electronic 
handbooks. In Proc. DIRW’00. 

V.L. Muehleisen. 1997. Antonymy and Semantic Range in English. Ph.D. thesis, Northwestern University. 

N. Fridman Noy and M.A. Musen. 2000. Prompt: Algorithm and.tool for automated ontology merging and 
alignment. In Proc. of AAAI 2000. 

J. Robinson and A. Voronkov, editors. 2001. Handbook of Automated Reasoning. Elsevier. 

M. Sanderson and B. Croft, 1999. Deriving concept hierarchies from text In Proc. SIGIR'99, pages 206-213. 

J. van Benthem and A. ter Meulen, editors. 1997, Handbook of Logic and Language. Elsevier. 


Ji 




191 




Vis Die - A New Tool for WordNet Editing 

Tomas Pavelek 
Karel Pala 


Abstract 

This contribution describes a new tool (named VisDic) for browsing and editing WordNet databases. 
It was developed in the Natural Language Processing Laboratory at the Faculty of Informatics, 
Masaryk University. In fact, it is not designed as a specialized tool for processing WordNet data only, 
generally, it has been developed as a tool for viewing and editing any lexical database as e.g. 
multilingual dictionaries, monolingual dictionaries, corpora, etc. From this point of view, WordNet 
can be also understood as a dictionary with special features. 

1. Polaris vs. VisDic 

VisDic uses XML data format which appears to be suitable for WordNet databases. Typically they 
consist of the entries of special sort called synsets. A synset can be understood as a structure having 
the following shape: [(List of synonyms), (Part of speech), (Gloss), (Semantic Relations)] (see [1]). 
List of synonyms is understood as the list of literals in a specific sense. Semantic relations can be 
divided into Internal Language Relations (ILR) and external relations (see [2]). 

The previous WordNet editing tool Polaris used its own Import-Export data format for representation 
of wordnets (see [3]). We would like to show that in contrast with Polaris VisDic offers the following 
improvements: 

1. The relation part of the Polaris format contains the name of the relation and the literal in a specific 
sense referred by the relation. This approach slows down the link evaluation. VisDic assigns a 
special label to every synset The label is called a key. It uniquely represents the synset. Any 
relation can be thus understood as a pointer which contains a key of the referred synset. 

2. Some of ILR, e.g. hyperonym and hyponym relations are reversible in the following sense: If 
synset A is a hyperonym of synset B, then synset B is a hyponym of synset A. It means that a 
hyponymical relation can be automatically derived from the corresponding hyperonymical 
relation. Then it is not necessary to store both of them and hyponyms can be omitted. 

3. It is not necessary to keep the same glosses in every WordNet database. They can be stored at one 
place. It is reasonable to create another dictionary (list) which will contain a shared data - ILI 
dictionary. Synset glosses can be looked up according to the synset key in this dictionary. 


In Figures la) and lb) you can see the difference between Polaris and VisDic synset representation in 
XML taken from [2], 


a) 0 @3@ WORD_MEANING 
1 PART OF SPEECH 'V 
1 VARIANTS 
2 LITERAL "life" 

3 SENSE 1 

3 DEFINITION "living things collectively; "there is no life 
on Mars"" 

3 EXTERNALJNFO 
4 SOURCEJD 1 
5 TEXT KEY "00003504-n" 

1 INTERN AL_LINKS 
2 RELATION M has_hyperonym" 

3 TARGET_CONCEPT 
4 PART_OF_SPEECH "n" 

4 LITERAL "being" 

5 SENSE 1 

2 RELATION "hashyponym" 

3 TARGETCONCEPT 
4 PART_OF_SPEECH "n" 


4 LITERAL "wildlife" 

5 SENSE 1 
1 EQ_LINKS 

2 EQ_RELATION "eq_synonym" 

3 TARGETJLI 
4 PART OF SPEECH "n" 

4 WORDNET_OFFSET 3504 

b) <SYNSET> 

<POS>n</POS> 

<SYNONYM> 

<LITERAL>life 
<SENSE> 1 </SENSE> 

</LITERAL> 

</SYNONYM> 

<ILI>00003504-n</ILI> 

<HYPERONYM>00002728-n</HYPERONYM> 

</SYNSET> 


Figure 1: Synset representations in WordNet database 
a) Polaris Import/Export format b) VisDic XML format 



However, it can be easily seen that XML format can be also used for representing other dictionaries. 
At'the present moment, VisDic can handle all the EuroWordNet databases developed within EWN 1,2 
project, Dictionary of Literary Czech (SSj£), Dictionary of Czech Synonyms (SCS), Collins Cobuild 
dictionary and the example of an English corpus represented iri XML. 

2. FUNCTIONS OF VlSDlC 

VisDic application and its functions will be described in this part. The performance of the tool is 
intimately tied up with the data representation mentioned above. A good definition of the dictionary 
structure can lead to the fast data searching and modifying as the WordNet XML structure from the 
previous part shows. 

VisDic is designed to hold up to ten dictionaries at the same time. The user can work with WordNet 
databases, multilingual dictionaries, monolingual dictionaries, corpora or other type of lexical 
databases, all in one window. One can edit or browse with even more copies of one dictionary at the 
same time, but only one of them can be changed, the other ones can be viewed only. This feature 
enables the user to look up for additional information in the same data source. 

Another reason for working with dictionaries simultaneously is the possibility of their cooperation. 
According to the same value of the entry identification value (key attribute, which was discussed 
above), the respective entry can be shown in the other dictionary. Then, in the case of monolingual 
dictionary, say the English one, it is possible to translate the same entry into another monolingual 
dictionary, e.g. the Czech one, although none of these dictionaries are multilingual. For example, this 
feature is very suitable for the WordNet multilingual view. 



~r. -rfnirr-f^jL-cst; 


& inanimate objectl, objartl. physical object! 
6 artefact:!. artifactl 


I 1 obverse:! 

11- reversed, vorao:1 
I 1 front! 
i 1 back.raarl 

- Iorboard:1, portl 

- starboard:1 

- nearside:1 
©upper surface:! 


strana 




[pffcf:!. nt>K2, pfecJelc2 


Figure 3: VisDic layout 


2.1. Main Application Window 

In the application, every dictionary has its own window frame. The frames are arranged from left to 
right. Their positions displaying the active dictionaries can be changed anytime. A window frame 
contains an edit box for dictionary querying where the user can specify what exactly he wants to find. 
The corresponding entries will be listed in a list box below. After selecting the desired entry it will be 
shown in the last part of the frame with regard to the type of view discussed further. For illustration 
see Figure 3. 


193 
























2.2. The Types of Entry View 


There are five basic types of entry view offered by VisDic. 

1. XML view displays the raw XML format in which an entry is represented. 

2. Text view shows the information contained in the individual part of entry, such as its head, 
senses, definitions, etc., in the user-specified format. It is possible to specify how the items will be 
displayed - in which font and which color - all in the dictionary configuration file (CFG). 

3. Tree view shows the current entry as it is United to the other entries. This information is then 
displayed in a tree structure. In the CFG file one can define which attribute will be understood as 
a parent and which attributes will indicate the children. Especially in WordNet databases, the 
typical tree structures can be formed by the hyperonym and hyponym ILR 36 , or by some kind of 
the meronym and holonym ILR. Every tree node can be expanded or collapsed. During 
expansion, the number of children is counted and displayed. Furthermore, a node can be also fully 
expanded. It means that every node belonging to the subtree indicated by a given node will be 
expanded. This function is helpful for counting all the entries of the subtree. 

4. Editing view enables to change information stored in the entry attributes by means of special edit 
boxes. The structure of the edit view can be defined in the CFG. 

5. Word view is not specific for a current entry. In it one can view all the words belonging to a 
special attribute. This attribute can be also specified in CFG. The list is alphabetically sorted and 
can come in handy in the case when the user wants to browse the dictionary systematically. It is 
possible to drag a specified word to the query edit box and look up the corresponding entries. 

3. Special WordNet Features 

The VisDic tool displays specific functions as well. They help to maintain the WordNet databases. 
Although all features presented here can be also applied to other dictionaries, they were primarily 
intended fdr manipulation within WordNet databases. That is the reason why they are mentioned 
separately. ' .... 

3.1. Finding Topmost Entries 

The special function of the tree view is to find all the entries which do not have any parent. In a 
WordNet specification of hypero-hyponymical tree: (H/H tree) it can find all synsets, which do not 
have any hyperonyms. In an ideal case these topmost synsets are exactly the ones corresponding to the 
top ontology synsets. However during WordNet editing some relations can be broken. Especially 
deleting the hyponyms belonging to a node can make the synset free, i.e. not having any parent, and 
thus they will become also the topmost ones. This function can be useful to prevent this kind of 
inconsistency. 




H!!! 


woraiNct 

Topmost 

. All 

/Ratio 

-Topmost 

AH- 

Ratio 

English 1.5 

23215 

94515 

24,7 % 

11 

60521 

-0,0% 

Czech 

277 

12824 

2,2 % 

52 

9727 

0,5 % 

Dutch 

545 

44128 

1,2% 

2 

34460 

-0,0% 

Estonian 

232 

7678 

3,0 % 

40 

5028 

0,7 % 

French 

240 

22745 

1,1 % 

36 

17826 

0,2 % 

German 

808 

15132 

5,3 % 

548 

9951 

5,5 % 

Italian 

1513 

40410 

3,7 % 

24 

30146 

0,1 % 

Spanish 

2569 

30350 

8,5 % 

13 

24135 

0,1 % 


Table 1: Number of topmost entries in ]rI/H tree in the EuroWordNet 2 project 


36 In fact, these two tags do not form a tree in all cases, because there is no restriction to have only one hyperonym for a 
synset, but for most of the synsets this condition is fulfilled. 


194 



For better illustration, Table 1 contains the ratio given by number of topmost synsets according to the 
H/H tree and the number of all synsets with regard to parts of speech. This statistic was prepared for 
all databases presented in the final report of the EuroWordNet 2 project (see [2]). The table shows that 
the nouns typically can be organized as trees while other parts of speech, particularly adjectives and 
adverbs do not form this type of a tree quite easily. This holds especially for the English WordNet 1.5, 
which contains 23215 topmost entries (considering all parts of speech) while the number of topmost 
nouns includes only 11 synsets. 

3.2. Supporting Expand and Merge Model used in WordNet Building 

VisDic has a special function in the tree view that is able to convert the whole structure of the subtree 
given by a specific node to the other WordNet. Lexicographers using Expand model for WordNet 
building can appreciate this feature. All the synsets that do not exist in the target database are copied 
to it. Then it is possible to translate their literals while their re lations are kept as they are defined. 

VisDic offers also a possibility to assign the key value to the current synset from the synset of another 
WordNet database. Then these two synsets become equivalent. This feature is helpful mainly when 
WordNet is built using Merge model. Developers of the new WordNet can build selected synsets, join 
them with ILR and then link them simply to another WordNet. This process is going to be used 
frequently in the Balkanet project (September 2001, see [4]). 

3.3. Importing and Exporting Files 

VisDic is able to import and export any XML structured file. The export is performed automatically 
during the dictionary loading. The XML file is converted to the inner binary representation which is 
not readable, but allows the fast searching and editing entries. 

Another VisDic function can export all WordNet or the specified subtree of any kind to a XML file. 
This file can be also mpdifiedtiy another tool or application and can be later imported back to VisDic 
again. This behavior is very similar to the Polaris importing and exporting mechanism. However, 
every synset is stored at one line in the text file. This feature is suitable for getting statistics of the 
WordNet. Especially in Linux systems many statistics can be obtained just using simple scripts or 
even one command (such as grop, etc.); 1 

4. Conclusions 

VisDic was developed mainly for browsing and editing WordNet databases. However, from the 
beginning it has been designed to view and edit any other lexical data in XML format - in this respect 
it essentially differs from other WordNet tools. Moreover, VisDic is able to work with XML format, 
which is regarded as a standard and which is readable by many other applications. However, it can be 
seen that there are still functions, mainly related to the multi-user processing which should be added. 
We intend to integrate VisDic tool in WordNet Management System providing the necessary services 
for Balkanet project that is now under way (see [4]). 

References 

Miller G.A., Beckwith R., Fellbaum Ch., Gross D., Miller K. (1993) Introduction to Wordnet: An On¬ 
line Lexical Database , Princeton University. 

Vossen, P. (August 1999) Final Report on EuroWordNet , CD Rom, Amsterdam University. 

Polaris User's Guide, Louw M., Lemout & Hauspie - Antwerp, Belgium, 1998, p. 59-82 
Balkanet Project, No. IST-2000-29388, led by D. Christodoulakis, University of Patras, DBLAB. 

Tomas Pavelek, NLP Laboratory, Faculty of Informatics, Masaryk University, Botanicka 68a, Brno, Czech 
Republic, 634 00. xDavelek@fi.muni.cz 

Karel Pala, NLP Laboratory, Faculty of Informatics, Masaryk University, Botanicka 68a, Bmo, Czech Republic, 
634 00, pala@fi.muni.cz 


195 



WordNet Web Navigation Interface: 

A Fast Interface to Navigate EuroWordNet Hierarchies 

Eneko Agirre 
Olatz Ansa 
Xabier Arregi 
Kike Fernandez 


Abstract 

This paper introduces WWNI, a new web interface for multilingual WordNets. The main features of 
this interface are the following: all items shown are clickable, multilingual information can be shown 
if desired, and the user can navigate across the hypemymy and meronymy hierarchies in a 
straightforward way. The multilingual WordNet database is implemented in mSql, and the cgi’s in 
Perl that produces dynamic html pages based on the W3C DOM model and javascript. Because of the 
DOM model, it requires versions equal or higher than Netscape 6 or Internet Explorer 5. As far as we 
know, it is the first web interface that allows for fast hierarchy navigation. It is accessible at the 
following URL: http://ixa.si.ehu.es/tresnak/wwni/index.html. 

INTRODUCTION 

Our research team has been working with WordNet for many years, and has been lately involved in 
the construction of the Basque WordNet. The Basque WordNet is being constructed using an interface 
developed in UPC (Benitez et al. 1998), but the standard WordNet tcl/tk interface is also used to 
consult the English WordNet. Linguists have often complained that these interfaces don’t allow for 
easy navigation across the relations and hierarchies that involve words, word senses and synsets. For 
instance, if they were working on a word and wanted to check the meaning of its synonyms, they had 
to type each of the synonyms in turn. Even worse, the only way to view the parts of the hypemymy 
hierarchies was either asking for the whole tree (sometimes very slow, and producing enormous trees) 
or asking for only the direct hyponyms of the root. In the former case it can be very slow and produce 
meaningless huge trees. In the later, if the user wants to go one level deeper he/she needs to type all 
the hyponyms in turn. 

In view of these limitations we started to look for an interface that had the following features: 

• Web-accessible, i.e. the actual multilingual WordNet database was in a remote machine 

• Easy to navigate: 

o Hierarchies shown as trees that were expanded on demand 
o All items were clickable 

• Fast and nice look and feel. 

• Easy and intuitive to use. For instance, the tree could look like directory trees. 

• No linked to any specific web browser 

Review of Other Interfaces 

We made an analysis of interfaces (others than the standard tcl/tk and the UPC one) listed in 
WordNet’s official web page 37 . We found out that all but two of them do not allow the user to click on 
the data (e.g. words, synsets) or navigate in the hierarchies: 

37 The list of interfaces checked is the following: Kyoto One Touch, Oxford SQL, Visual Navigation Tree, TreeWalk, 
Greaves SQL, WN Python, Enst interactive, Vancouver One Touch, the UPC euroWordNet interface Check the 
official WordNet site for more detailes. 



• “The visual navigation tree” does allow for both things, but is rather slow, and the way to show 
the trees was very awkward. 

• “The WordNet treewalk” allows for easier navigation, but it is far from intuitive (there relations 
are represented by symbols) and it is a standalone program (not web based). 



Figure 1: The visual navigation interface. 


Design 

Figure 2 shows the main window of the WWNI interface. It is divided in three frames: the control 
frame, the word sense frame and the hierarchy tree frame. 

The upper part is the control frame. The target word is input here, alongside the desired WordNet 
(English, Basque or other). The multilingual check box can be used when we wish to show the 
information for one language or all the languages in the database. 

After pressing the search button the information about the senses of the target word are shown in the 
word sense frame (left), including the gloss, and the words in the synset. If the multilingual box is 
checked the information for all equivalents in the languages in the database are also shown in the 
word sense frame. Each language uses a different color. For each of the word senses we can choose 
info to show the complete list of relations for the synset in a separate window (see figure 3), or to start 
a hyponymy/meronymy tree in the third frame. 

At the beginning the hierarchy tree is rooted in the word sense chosen, and only direct hyponyms, if 
any are shown. Each concept in the hierarchy allows for expmsion or collapsing clicking the + / - 
box i e we can expand / cut the subtree below any synset. Only three variants are shown for each 
synset, but clicking ... we can see all. The bulb represents one synset and clicking on it a . menu is 
displayed with the following choices: synset info for opening a window with all the info of the synset 
(cf. figure 3), become root for deleting all the tree except this synset and its direct hyponyms, open / 


197 
























close for opening or closing the hyponymy tree (same functionality as clicking the + / icons), and 
finally hyper/holo for showing the hypemym(s) / holonym 


The hierarchies in WordNet are rather lattices than strict trees. As there are not straightforward 
algorithms to control lattices in the way we can control trees, we decided to use trees. This means that 
expanding a tree upwards and downwards we could display the same subtree in different places. In 
order to control the display properly hyponyms of the root synset can only be expanded downwards, 
and hypermyms of the root can only be expanded upwards. The root is the only synset that allows for 
downward and upward expansion. The become root option allows for quickly setting a different 
synset as root. 



WordNet L5 


P » vehicle 


Hierarchy tree frame 


iSanM I a conveyance for paitenSefJ'tir/fWg^t-dtf? 
(cable railway; THey tobtf'a cabii'tifttb tfrjb : tdg;Of Aid’ 

..CSs*,. aSses 


P ► a utomotive ve ftclo 


P ►SUta ► automobile > car > 


Swim 2,< 7 wheeled; uiually. propeilBd.by an inter 
combustion' engine; ,"he nsodsa a car'ja MaHlw 


> phaeton > tourer > 


WMdMMi 


Hypernyms 


%■ P i etocfc car 
fr P ► snflan 

$ P ► roadster ► runabout » 
Eft- P >rq ^e car > racer > rat 
P Mmo Mimouslne 
& P,» |eep Manflrover 
® 9 ► hearse 
P > hatchback 
P > hardtop 

\r > cruiser ► p atrol cau^ 


[railroad: three cars had jumper 

iia'h .'vi'AiiivKsv 


Hyponyms 


iSmse S car suspended Irom irt iirthlp and ci 
personnel and cargo and powei plant . • 


s P ► gpnvafltbla 


Figure 2: Main window of WWNI. 






















Figure 3: Information window 


All relations and available information for a given synset can be checked using die info option either 
in the word sense frame or in the menu of the hierarchy tree frame. An example is shown m figure 3 
As in the word sense frame, whenever the multilingual box is checked, the information for all 
languages is shown, using different colors to differentiate the languages. 

All words (variants) and synsets in any frame or window are clickable. For words the word sense 
frame is updated with the corresponding word sense information. For synsets the option in the me 
(hierarchy tree frame) or the link (word sense frame) determines the action to be taken. 

Architecture and implementation details 

In the server side, the multilingual WordNet database is implemented in mSql, and the cgi’s in Perl 
The Perl scripts access the database using the DBI and MSql modules. The Internet serverused is 
Apache. Each time the web browser requests for a page, the respective Perl cgi is run. For all frames 
except the hierarchy tree frame, the Perl cgi returns static html code. 

For the hierarchy tree frame, there is a set of javascript functions that are loaded wit*i the main, page. 
These functions allow for the manipulation and display of the tree, using the W3C DOM mod 
Every time the user changes the tree, a request to the corresponding cgi in the saver is made^ The^cgi 
in this case, returns an empty frame with javascrit function calls that include the informaUon needed 
to update the tree. When the function calls are executed m the client side the tree is changed 
accordingly. This process simulated that whenever the user changes the tree, the javascript functions 
with the requested information are executed (red lines in figure 4). 

Because of the need of a browser with DOM, it requires versions equal or higher than Netscape 6 or 
Internet Explorer 5. 














Server Client 


Figure 4: Architecture of the Interface (client) and the Server 


Conclusions 

The WWNI web interface for multilingual WordNets is an amiable interface that has as a special 
feature the easiness to navigate across the hypcmymy and holonymy hierarchies. It also allows the 
user to easily access any word or synsct, as all items iri the interface are clickable. It is accessible at 
the following URL: http://ixa.si.ehu.es/tresnak/wwni/iidex.html. 

The interface is built over a relational database repres entation of the EuroWordNet design, but could 
easily be ported to other WordNet implementations that provide the convenient API’s for Perl. 

References 

Benitez, L., Escudero, G., Farreras, J. & Rigau, G. (1998) WWI: A Multilingual WordNet Interface 
using the Web. Technical Report LSI-98-6-T. LSI Department, Universitat Politecnica de Catalunya. 

Eneko Agirre, IXA NLP Group, University of the Basque Country, 649 pk. 20.080, Donostia. Spain, 
eneko@si.ehu.es 

Olatz Ansa, IXA NLP Group, University of the Basque Country, 649 pk. 20.080, Donostia. Spain, 

Xabier Arregi, IXA NLP Group, University of the Basque Country, 649 pk. 20.080, Donostia. Spain, 

Kike Fernandez IXA NLP Group, University of the Basque Country, 649 pk. 20.080, Donostia. Spain, 


200 










Storing and Retrieving WordNet Database (and other Structured 
Dictionaries) in XML Lexical Database Management System 

Pvel Smrz 

Abstract 

This paper deals with an efficient storage and retrieval of various kinds of lexical information in 
a specialised lexical database management system. Relevant aspects of the XML format and many 
related technologies are surveyed first. The second section describes motivations and internals of the 
designed and implemented client/server system. The last section brings information about one specific 

sub-modul of our system that will integrate lexical rules for regular polysemy and derivational 
morphology paradigms. 

1. The XML Family 

XML (extensible Markup Language) (Bray et al. 2000) is an international standard for representation 
and interchange of data. It presents a powerful instrument enabling general markup of all forms of a 
structure, mutual references and multilevel structure hierarchies. 

The XML language, oriented primarily to the area of World Wide Web applications, is a simplified 
dialect of SGML (Standard Generalised Markup Language). Consequently, it is theoretically weaker 
than SGML in some aspects. However, thanks to dozens of connected technologies enabling 
document transformations, definitions of constraints, structure validation, and pointers within one 
document as well as inter-document mutual references (see below), XML is an appropriate tool 
allowing to keep in touch with the extreme speed of progress in the field of information technology. 

Users can piggyback many existing mechanisms for data access and manipulation when working with 
lexical databases in the XML format. We will speak about a family of XML standards. XML is a 
markup language in the base form and so it allows the identification of text elements, hierarchical 
structure and references. The structure of XML encoded documents is described by DTD (Document 
Type Definition) occurring already in the SGML standard. DTD defines generalised structure rules 
and determines what is permitted in each particular document encoding. 

The merits of document validation offered by DTD are extended by the XML Schema definition 
language (Thompson et al. 2001, Biron, Malhotra 2001). It provides the way to restrict and document 
the meaning, usage and relations of particular parts of XML documents. Default values of attributes 
and elements can be specified, for example. From the concqptional point of view, the definition of 

XML Schema can be considered as an abstract data model of the described document class (Ide 
2000 ). 

Other members of the “XML family” are stylesheet languages XSL (extensible Stylesheet Language) 
(Adler et al. 2000) and XSLT (extensible Stylesheet Language for Transformations) (Clark 1999 
Clark 2001). Stylesheet determines what action will be accomplished if the given condition is 
fulfilled. XSLT processors work with XML documents represented by tree structures and transform 

"Vcr 4 *? f rbltrary other format hy means of information selections, re-arrangements and additions, 
the XSLT language supports the selection of the element content or its parts from one or more XML 
documents and the transformations of content as well as names of elements. 

qUCry mechanism is absolutely indispensable to access the content of large collections of 
XML documents. Several XML query languages have been proposed to the World Wide Web 
Consortium (W3C) in last few years. The most popular is probably XQuery (XML Query Language) 
(Chamberlin et al. 2001) able to specify complex queries in a human-readable form (an alternative 
representation appropriate for automatic processing- XqueryX - complies with the XML syntax.) 


201 



Several definition standards for references between XML documents ’° u ™* 

u v.onJcm nf‘XT ink CDeRose 2001a) accepts the interconnection between two or m 

rsr;r -> »-■"> •?»? * 

predicates manipulating character strings, so that particular pa so eeme addressing 

ssas’JsasJtt’:ssa*—* ■*—- 

document parts. 

?*- -— 

Great! Maybe Later!” in that time. 

veraions of the most'comrnonVorld Wide'web'^rowsCTS^su^iort^the format. MeanwMl^^ v er>^hing^ 
domain of lexical data representation and interchange. 


2. XML Lexical Database Management system 


SSSSSSSp 

to store and retrieve lexical data in XML efficiently. The following text offers an answer. 

There are several ways how to store lexical data in XML. Relational database >™nagement systems 
(RDBMS) ^laya crucial role in the storage of richly 

is their maturity and almost three decades of experience with them. However standara * 
databases are not particularly suitable for manipulating relation 

to the sets of relational tables. 

Main discrepancies between relational databases world and “XML family” are summarised in 
(Chamberlin et al. 2001): 

maintain, though not understand). 


202 



expensive database technology. The large popularity of WordNet has been, among other, brought 
about by the wide, free and easy access to the whole database. 

To survey our requirements in addition to the essential need for a very efficient storage and retrieval 
we can state: 

Non-commercial license (at least the essential parts under an open-source license, e.g. the GNU 
Public License); 

Open system, able to integrate with other applications, future development; 
multiplatform, especially clients should be provided in OS independent languages, e.g. in Java. 

All these considerations led us to the decision to implement a system specialised on the efficient 
storage and retrieval of lexical databases. The system is called MAXXL and originated as a practical 
output of Master thesis at the Faculty of Informatics, Masaryk University in Bmo, Czech Republic 
(Karasek 2000). 

The system takes full advantage of the client/server architecture. The server part manages data storage 
and retrieval, clients mediate the communication with users, query definition and a presentation of 
answers. 

Several programming tools have been employed in the implementation of MAXXL. All the code is 
written in C++ with interfaces to other languages (Perl, Python, Java). A special attention has been 
paid to the object oriented design - abstract classes, inheritance and polymorphism play a crucial role 
in the unified access to all the objects. The technique of data streams allows then an abstraction of 
source; an operation is the same whether, read from text file, string, or socket. The use of STL. 
(Standard Template Library), which contains an efficient implementation of basic data structures has 
also proved to be a good decision. 

MAXXL is designed as an open system, new modules can easily be added. The basic characteristic of 
the system is also its total independence on any particular structure of XML. The processing is based 
on a given DTD and indexes for an efficient retrieval are generated based on additional information 
about individual element types. The most important information is an identification of primary key 
element. 

MAXXL defines three core classes that are assigned to elements of stored XML (Karasek 2000): 

CAhead serves as an identification of primary key of a given lexical database entry (e.g. ILI - Inter 
Lingual Index — in EuroWordNet databases). It has to be unique over the all set of entries because the 
number assigned to this key by the lexicon is used as an identifier for the binary storage and the query 
evaluation. 

CAplain is assigned to the simplest elements - leaves in the XML tree, so that it also cannot include 
other elements. In the XML, the type //PCDATA corresponds to CAplain. 

CAwrap is an essential class to take hold of tree structure as it allows another CAwrap class to be its 
descendant. Internal representation is based on the list of pointers. 

The assignment of the above mentioned classes to elements of a source XML is given by a definition 
file. MAXXL provides a simple tool that tries to guess types of all elements from a given sample of 
data. Users can modify the output from this tool to achieve exact match of the intended structure. All 
these steps can be replaced by a direct extraction of required information from the XML Schema 
definition if it is provided for the source data. 


Lexical database in MAXXL is a set of XML documents. These documents can use various character 
encodings so that the problems associated with the usage of all different alphabets can be solved. 


203 



MAXXL accepts data represented in the UNICODE format, UTF-8 coding (128 characters encoded in 
1 byte - ASCII, 1920 characters in 2 bytes - all Czech characters, Greek, Hebrew, .... 
characters in 3 bytes - Chinese, Japan, 4,5,6 bytes codes still unassigned). Consequently, it is able 
process XML data from different languages at the same time. 

There are two ways of querying MAXXL. The first one takes advantage of XML as itsi basic model 
The access to the data is provided by the standard mechanism of the XSL transformation XSLT s 
used as a query language. The result of a query is then usually a set of XML nodes that can be 
wrapped by a root element to create a well-formed XML document. This method suffices only for 
small lexical databases. The fast, compact and portable XML toolkit sablotron (Kaiser 2001) 
integrated as a XSL processor in our experiments. 

The more robust approach to the querying is the MAXXL native query mechanism. The system 
defines its own quer£ language, which is specially tailored to reflect the needs of lexical data ^ ’ 
The result of a query takes the form of a sequence of XML elements or a sequence of slm P ' 

Operators of exact match, prefix searching and general substring localisation are provided. Very 
efficient Karp-Rabin algorithm is employed for these tasks. 

MAXXL takes into account the need of morphological expansion of some q uerie s in various 
languages. The system offers a mechanism for an integration of morphological analysers. This process 
tells which of the following three operations are implemented by the given analyser: 

expand - generates all word forms and adds it to a list; 
stem - replaces the given word by its base form; 

patt - replaces a given word by its pattern that can be defined arbitrarily. 

The system also provides a direct connection to the corpus management system Manatee designed and 
implemented at our faculty (Rychly 2000). It reflects the need for authentic language> »‘ ■“ 
linguistics. This tool enables a KWIC (keyword in context) classification according to semantic 
similarity of words in neighbourhood of the searched expression. 

MAXXL has been implemented and is available under the GNU Public License. It will be massively 
used in the work on the Czech part of Balkanet project, in the process of i °v-ml 

machine-readable version of the Dictionary of the Literal Czech Language (SSJC) and in several 

others tasks. 

3. FUTURE DIRECTIONS - REGULAR POLYSEMY AND DERIVATIONAL MORPHOLOGY PARADIGMS 
IN MAXXL 


A big deal of regular polysemy can be interpreted as metonymy extension of meaning Cruse (2000) 
presents three aspects of motivation for using metonymy - economy, ease of access to the referent and 
highlighting of the associative relation. Senses in a metonymy group are tied by systematic 
connections realized as so-called subtype coercion (Pustejovsky 1995), or lexical (inference) rules 
(Leech 1974, Ostler and Atkins 1992). 


Currently, we are working on a submodul of MAXXL that will integrate lexical rules with die core 
lexicon and creates a base for Czech lexical database organised on the similar principles as 
EuroWordNet. The main scope of linguistic phenomena that should be handled by lexical rules are 
sense extensions, namely regular metonymic and some productive metaphoric processes. The 
following (non-exhaustive) list shows the most frequent ones: 


tree species for the type of wood 
tree species for fruit 
plant for flower 
plant for food 


animal for fur or hide 
animal for food 
organisation for building 
material for product 


204 



container for quantity (volume) 
container for contained 
possessor for possessed 
represented for representative 
whole for part 
part for whole 
place for institution 

These relations are defined in strictly declarative way in the form of XML data file. A big advantage 
of the DTD specification file is that it handles all the metonymic processes basically in the same way 
as other big group of semantic relations - derivational morphology processes. Differences are, of 
course, preserved in the subtype of the link between word-senses. In the EuroWordNet database, 
derivational morphology paradigms can be encoded as ILRs (Internal Language Relations) (Klimova 
and Pala 2000). Productive patterns in our system are especially: 

deverbatives (po6itat/count - po£it&ni/counting - poSitany/counted - po5itan/is counted - pofcitajici/is 
counting) 

diminutives (dfim/housc - domek/small house - domeCek/tiny house, the house I like) 
aspectual relations of verbs (fici/tell - fikat/tell all the time) 

iterative relations of verbs together with “degrees” (chodit/walk - chodivat/walk regularly - 
chodivavat/ use to walk) 

relations between an animate noun and derived possessive adjective (otec/father - otcuv/fathePs) 
feminine from masculine nouns (soudce/judge - soudkynfi/female judge) 
activity - agent - involved agent (ucit/teach - uSitel/teacher) 
activity - instrument - involved instrument (sedat/sit - sedadlo/seat) 

Note that compound forms comprise only a small portion of the morphological processes in Czech but 
there is principally no reason why compounding could not be treated in the same way in our system. 

The designed system will benefit from a hybrid approach combining the advantages of different 
approaches that deal with two orthogonal parameters - rule application time and rule trigger - 
considered in (Onyshkevych 1999). Since one of the outputs of this research should be a lexical 
database, we have to invoke all the lexical rules at acquisition time and review the resulting senses 
manually. In order to escape the trap of overgeneration when rules are triggered by constraints, we 
assume the full enumeration of rule-application possibilities. Of course, it implies a lot of tedious 
work and the need to revisit old lexical entries when a new lexical rule is added (Onyshkevych 1999). 
However, we will try to generalise the rule-triggering criteria by means of the automatic constraint 
acquisition. 

Conclusions 

The topic discussed in this paper concerns the issue of XML lexical database management system and 
its specific submodul dealing with lexical rules. We come to the conclusion that the XML format is 
definitely the appropriate form for data representation and interchange. The implementation of XML 
export and import features is probably the best way how to share linguistic data. The presented system 
goes one step further and processes data directly in XML. This approach offers an open architecture, 
easy integration with other applications and future extensions. It also allows to combine the 
advantages of wordnet-like networks with the system that will integrate lexical rules for regular 
polysemy and derivational morphology paradigms. The system will be intensely exploited within the 
Balkanet project, at least in its Czech part. 

REFERENCES 

Adler S. et al. (2000) Extensible Stylesheet Language (XSL). Version 1.0. W3C Proposed 
Recommendation. http://www.w3.org/TR/xsl/. 


composer for music by same 
food for person ordering same 
collection for facility 

proper names for companies for their product 
personal names of people for properties 
plural of mass nouns for portions or kinds 



Biron P. and Malhotra A. (2001) XML Schema Part 2: Datatypes. W3C Recommendation. 
http://www.w3.org/TR/xmlschema-2/. 

Bray T. et al. (2000) Extensible Markup Language (XML) 1.0 (Second Edition). W3C 
Recommendation. http://www.w3.org/TR/1998/REC-xml. 

Chamberlin D. et al. (2001) XQuery 1.0: An XML Query Language. W3C Working Draft. 
http://www.w3 .org/TR/xquery/. 

Clark J. (1999) XSL Transformations (XSLT). Version 1.0. W3C Recommendation. 

http://www.w3.org/TR/xslt/. 

Clark J. (2001) XSL Transformations (XSLT). Version 1.1. W3C Working Draft. 

http://www.w3. org/TR/xsltl 1/. 

Clark J. and Derose S. (1999) XML Path Language (XPath). Version 1.0. W3C Recommendation. 
http://www.w3 .org/TR/xpath/. 

Cruse D. A. (2000) Meaning in Language - An Introduction to Semantics and Pragmatics. Oxford 
University Press. 

Derose S. et al. (2001a). XML Linking Language (XLink). Version 1.0. W3C Recommendation. 
http://www.w3.org/TR/xlink/. 

Derose S. et al. (2001b). XML Pointer Language (XPointer). Version 1.0. W3C Last Call Working 
Draft. http://www.w3.org/TR/xptr/. 

Ide N. (2000) The XML Framework and Its Implications for the Development of Natural Language 
Processing Tools. In: Proceedings of the COLING Workshop on Using Toolsets and Architectures to 
Build NLP Systems. 

Kaiser T. (2001) Sablotron. http://www.gingerall.cz/charlie/ga/xml/p_sab.xml 

Karasek L. (2000) A System for the Development and Presentation of Mono- and Multilingual 
Dictionaries. Master thesis, Faculty of Informatics, Masaryk University, Brno (in Czech). 

Klimova J. and Pala K. (2000) Application of WordNet ILR in Czech Word-formation. In 
“Proceedings of LREC’2000 (The Second International Conference on Language Resources and 
Evaluation)”, Athens, Greece. 

Leech G. A. (1974) Semantics. Harmondsworth: Penguin. 

Onyshkevych B. A. (1999) Categorization of Types and Application of Lexical Rules. In: “Breadth 
and Depth of Semantic Lexicons”, E. Viegas, ed., Kluwer Academic Publishers. 

Ostler N. and Atkins B. T. S. (1992) Predictable Meaning Shift: Some Linguistic Properties of Lexical 
Implication Rules. In “Lexical Semantics and Knowledge Representation”, J. Pustejovsky, ed., 
Heidelberg: Springer Verlag. 

Pustejovsky J. (1995) The Generative Lexicon. MIT Press. 

Rychly P. (2000) Corpus Managers and Their Efficient Implementation. PhD Thesis, Faculty of 
Informatics, Masaryk University, Brno (in Czech). 

Thompson H. S. et al. (2001) XML Schema Part 1: Structures. W3C Recommendation. 
http://www.w3 .org/TR/xmlschema-1 /. 

Pavel Smrz, Faculty of Informatics, Masaryk University, Botanicka 68°, 60200 Brno, Czech Republic, 
smrz@fi.muni.cz 



An Architecture for Engineering Sublanguage WordNets 

Kalyan Moy Gupta 
David W. Aha 
Elaine Marsh 
Tucker Money 

Abstract 

We describe software arphitecture to interactively acquire and maintain sublanguage WordNets. The 
architecture builds upon WordNet semantic structure and includes integrated capabilities for concept 
element discovery, concept identification, and concept maintenance. We describe the completed 
components of our on-going implementation by application to Navy Lessons Learned documents. Our 
preliminary observations indicate that there is very little overlap between concepts discovered from 
sublanguage documents and WordNet. 

Introduction 

One of our immediate goals is to enhance decision-making in the U.S. Navy by capturing, organizing, 
and leveraging their operational experience recorded in short semi-structured documents called 
lessons. Central to this goal is the capability to retrieve short documents and discover patterns in them 
at an abstract level. This capability requires semantic tagging and subsequent restructuring of 
documents (Gupta et al., 2001). Semantic tagging typically requires lexical resources that include a 
rich set of concepts and relations such as those included in WordNet (Fellbaum, 1998). However, 
using a general-purpose lexical resource such as WordNet for a narrow sub-domain is limited by 
insufficient coverage (i.e., missing wordforms and senses), its fine-grained nature, and the 
unnecessary computational overhead due to fragments that are irrelevant to the sublanguage (Zemik, 
1991; Evans and Kilgarriff, 1995). . 

Natural language processing (NLP) applications such as information extraction and document 
restructuring rarely deal with general language. Consequently, sublanguage specific lexicons need to 
be engineered to meet the specific application needs (Evans & Kilgarriff, 1995). Sublanguage lexicon 
engineering approaches range from reusing and fine tuning general-purpose lexicons (e.g., Vossen, 
2001, Rada & Moldovan, 2001) to using them as resources for creating new ones (Yamaguchi, 2001). 
Although successful, these approaches represent a small subset of methodologies available for lexical 
engineering. A more general approach capable of accommodating the development and evaluation of 
a wide range of lexical engineering methodologies is needed. To this end, we present an extensible 
architecture that integrates the lexicon engineering tasks of concept element discovery, concept 
identification, and concept maintenance. 

Our motivation in developing a sublanguage WordNet (SubWordNet) engineering architecture is two 
fold: (1) to enable rapid development of SubWordNets for NLP applications, and (2) to enable the 
development, evaluation, and subsequent extension of our architecture with new methodologies. We 
have implemented the concept element discovery components and concept maintenance components 
of the architecture. We present our preliminary observations obtained by using the architecture to 
develop a SubWordNet for Navy Lessons. 

The remainder of this paper is organized as follows. Section 1 defines the lexicon engineering 
process. Section 2 describes the components of our architecture. We discuss the implementation of 
our architecture and report our preliminary observations of component performances on Navy lesson 
documents in Section 3. The conclusion summarizes the paper and presents directions for future work. 



1. LEXICON ENGINEERING PROCESS 

Lexicon engineering has evolved from completely manual processes to automaticaHy adapting 
machine-readable dictionaries and corpus-based development of lexical knowledge bases (Evans & 
Kilgarriff, 1995). Recent efforts for lexicon engineering include extraction of ontologies from text 
(Maedche & Staab, 2001), extending and trimming WordNet from technical documents (Vossen 
2001), and extracting relations from text using WordNet (Yamaguchi, 2001). We generalize these and 
other relevant approaches into an iterative three-step lexicon engineering cycle for developing 
SubWordNets as follows (see Figure 1): 

Rtf" 

Sublanguage 
docutofeni* •' Jr 


Discover Concept 
Elements 


Words, Phrases. 

& Relations . 


CSS 


+ Identify Concepts — 

-WbfdN 


1* 


9 

Maintain Concepts 


New Concepts & 
Relations 


Figure 1: Iterative SubWordNet Engineering Process 

1 Discover Concept Elements : The goal of this step is to discover concept elements, which include 
words, generated multi-word phrases, and potential relationships among these elements that^occur m 
input sublanguage documents. For example, “Marine Mountain Warfare Training and Mant.mp 
Interception Operation Training” would be discovered as multi-word phrases in the Navy LeS ““ 
domain. An unnamed relation between them could be discovered and suggested 
Subsequently, a user could identify the relation as of meronym/holonym type This step t > , P lca ly 

a combination of shallow language and text processing along with learning, discovery, and extrac 

techniques. 

2 Identify Concepts: The objective of this step to identify new concepts and relations from phrases 
and relations discovered in Step 1. Concept identification is supported by grouping phrases mto 
concept nodes and establishing concordance with synsets in WordNet Forexampietheconcept 
“amphibious assault” can be identified as a synset in WordNet related to amphibious opOTbon by 
hypemym relation. The new concept nodes and relationships can be used to update the Sub Word 

3 Maintain Concepts (Update SubWordNet): This step allows controlled insertion, deletion, and 
updating*^of concepts and rela.ions derived from Step 2 in a SubWordNet wh.hi maintaining its 
integrity. For example, a user may decide to transfer the WordNet concepts amphibious «^.on 
and 8 “amphibious assault” to the SubWordNet while excluding its hyponym synsets from WordNet 

such as “battle of wake”. 

Users can iterate through these steps with as many sublanguage documents as needed to develop 
SubWordNets and to maintain them on an ongoing basis. 

2. SUBWORDNET ENGINEERING ARCHITECTURE 

Our architecture preserves and builds on the semantic structure of WordNet (Felbaum, 1998). In a 
SubWordNet we refer to WordNet synsets as concepts and the relationship among them as rela! ° _ 
Each concept includes member phrases or wordfomis that are synonyms. 

components for the three SubWordNet engineering sub-processes We moduianzed tos rnehit^ 
for scalability and extensibility into three layers: (1) the graphical user interface (GUI) layer, (2) the 


208 



process layer, and (3) the data layer (see Figure 2). This architecture is presented in detail in the 
following sub-sections. 

2.1. Concept Discovery Components 

The three layers supporting the concept discovery task are the Concept Discovery Workbench (CDW) 
as the GUI layer, the Concept Discovery Engine (CDE) as the process layer, and the Discovered 
Concepts Database (DCD) as the data layer. 

2.1.1. Concept Discovery Workbench 

The CDW provides a friendly GUI that allows users to select one or more sublanguage documents for 
processing, activate the CDE components to process the selected documents, and to view and 
manipulate discovered concept elements. The CDW also provides summary distributional information 
of words and phrases for assisting users. 

2.1.2. Concept Discovery Engine 

The CDE has been designed to include seven text and language processing components such as the 
Tokenizer to perform sentence segmentation and wordform tokenization, and the POS Tagger to 
assign part of speech to wordforms. Additional components include the Lemmatizer and the Phrase 
Generator. The Lemmatizer is included to derive base wordforms. The goal of Phrase Generator is to 
generate multi-word canonical phrases by using shallow phrase generation rules. This architecture can 
be extended to include any phrase generation technique (e.g., text chunking or syntactic parsing). 
These phrases can be used to create SubWordNet concepts. 

The ability to identify and extract acronyms and abbreviations is crucial for processing technical 
documents such those we encounter in the Navy Lessons Learned System (NLLS, 2001). The 
Acronym Extractor has been included in the CDE to support this ability. The acronyms and their 
acceptable expansions are represented as synonym members of SubWordNet concepts. 

In WordNet, each synset includes a gloss, which is a sentence that clarifies its meaning. One of the 
goals of CDE is to support human users in gloss authoring and sense identification. The Context 
Fragment Collector was included to collect and store sentence fragments as contexts in which the 
concept elements occur. These sentence fragments can be used to support gloss authoring and to 
develop sense disambiguation components for NLP applications. 

Concepts in SubWordNet support the same relations as those in WordNet. CDE provides the ability 
to use techniques for discovering such relations through the Relation Discoverer component. Relation 
Discoverer can perform relation discovery using techniques such as latent semantic indexing. 
(Deerwcster et al. , 1990) and collocation statistics (Gerhard et al., 2001), hyponym discovery using 
syntactic patterns (Woods, 1997) and lexical patterns (Hearst, 1992), and non-taxonomic relationships 
learning using generalized rule learning (Maedche & Staab, 2000). As explained in Section 2.2.2, 
relations discovered using these techniques could be compared with those in WordNet for assisting 
users to identify appropriate relations and their subsequent inclusion in a SubWordNet. 


209 



Concept 

Discovery 

Workbench 


Concept \ 
Identification ) 
Workbench J 


SubWordNet 

Editor 



Legend 

I I Implemented 
r "i Not Yel Implemertefl 


Discovered 

Concept* 

Database 


Identified | 

Concepts j WorJNet 
Database 



Figure 2: Architecture for Engineering SubWordNets 


1.1.3. Discovered Concepts Database 

The DCD stores all the CDE processing results such as discovered phrases, acronyms, relations, and 
their summary statistics. 

1.2. Concept Identification Components 

Concept Identification components include the Concept Identification Workbench (CIW) as the GU T 
layer and the Concept Identification Engine (CIE) as the process layer. The database layer accesses 
the Identified Concepts Database (ICD) and the WordNet. 

2.2.1. Concept Identification Workbench 

The CIW has been designed to access the processes included in the CIE. Its users can selectively 
process phrases and their relationships discovered by the CDE or those that are inserted by users and 
establish concordance between them and the WordNet synsets. Users can use this information to 
create and insert the novel concepts and relationships into Sub WordNet using the SubWordNet Editor 
(See Section 2.3). 

2.2.2. Concept Identification Engine 

The CIE has been designed to support concept identification through two of its components: the 
Phrase Clusterer and the Concordance Processor. The goal of the Phrase Clusterer is to group 
phrases into concept nodes by applying inputs from WordNet synsets and their relationships. 
Concordance Processor can subsequently establish concordance between concept nodes and WordNet 
synsets by matching each node’s phrase members with WordNet synsets. 

2.2.3. Identified Concepts Database 

ICD stores the phrase clustering and concordance processing results. Users can retrieve these results 
to update their SubWordNet. 

2.3. SubWordNet Maintenance Components 

The three layers providing SubWordNet Maintenance functions include the SubWordNet Editor (SuE) 
as the GUI layer, the SubWordNet Editor Engine (SEE) as the process layer, and the SubWordNet as 
the data layer. ' 
























2.3.1. SubWordNet Editor 


SuE has been designed to include a complete range of editing functions including creating, updating, 
deleting, searching, and manipulating SubWordNet concepts and their relations (i.e., WordNet synsets 
and their relationships). The concepts and relationships can be manually created by a user or 
automatically created from those that are discovered and identified by the CDW and the CIW. 

2.3.2 SubWordNet Editor Engine 

SEE supports SuE to perform its functions. It includes the Concept Manipulator and the Relation 
Manipulator for maintaining noun, verb, adjective, and adverb concepts. In addition, SEE includes the 
Concept Finder and the Relations Finder. The Concept Finder allows a user to locate concepts for 
phrases and concepts in a SubWordNet. Likewise, the Relations Finder allows a user to search for 
relations between two phrases or concepts. These two components automatically maintain the 
SubWordNet integrity by preventing the creation of duplicate concepts and illegal relationships. 

2.3.3. SUBWORDNET DATABASE 

The SubWordNet database is the repository for the SubWordNet concepts and their relations. It 
preserves and extends WordNet. 

3. Implementation and Observations 

Our architecture has been partially implemented and is a work in progress. The implemented 
components are shown in Figure 2 by solid boxes. The components yet to be implemented are shown 
with dashed boxes. The SubWordNet maintenance components have been fully implemented and the 
majority of concept discovery components have been implemented. The concept identification 
components have not yet been implemented. The GUI and process layers were implemented in Java 
1.3 and the data layer was implemented using MS Access and ASCII text files. 

In the following subsections, we briefly introduce Navy Lessons Learned documents and present our 
observations in using the implemented components for developing Navy Lessons SubWordNet. We 
present SuE followed by CDW because that is the order in which they were implemented and 
evaluated. 

3.1. Navy Lessons Learned Domain 

The US Navy operates a Navy Lessons Learned System (NLLS, 2001) with the aim of improving its 
operational performance by reusing its problem-solving experience. Over 4,000 lessons a year are 
submitted to NLLS from operations and naval training exercises. An unclassified lesson is shown in 
Figure 3. 

Each lesson contains approximately a page of semi-structured text. Over 12,000 active and inactive 
unclassified lessons are available for the restructuring task. Our aim is to improve their retrieval and 
dissemination process. For our ongoing efforts, we selected 830 documents on the subject of military 
training. 


211 



Letson ID:LL«nj-05597 Date: 05-May-98 
TITLE: 

SMT tRAINER ABOARD SUP 

Naval Task LUt Taskf: #1 
0 491 Dsvotoc TrsWnj Plan* and PTCsr»m* 

OBSERVATION 

THE BATTALION LANONO TEAM (BUT) MAS BEEN UNABLE TO LISE THE INFANTRY SKILLS 
MARKSMANSHIP TRAINER (ISMT) WMLE ABOARD Tl* USS WASP DUE TO A LACK SPACE TO 
PROPERLY SET UP AMD SECURE ISMT. 

DISCUSSION 

THERE HAS BEEN FRICTION WITH THE BLUE AMD THE OREEM SOE ON THE LOCATION AMO 
STORAGE OF THE ISMT SINCE WE EMBARKED THE SHF 

LESSON LEARNED 

PROPER COORDINATION MUST BE MAOE VIA THE NAVY f OR EM8AW1NO AND STORAGE OF 
THE ISMT OURNO PHBRON MRI INTEGRATION TRAMNO IF THIS S NOT DONE T»* ISMT WLL SIT 
DORMANT ON THE SHIP AND THE MARNES TRAWNO Wll SUFFER. 


RECOMMENDATION 

DURNO THE INITIAL LOAD PLANNNO THE BATTALION OUMBl MUST MAKE THE EMBARKATION 
OFFICER AWARE OF THE SPACE NEEDED FOR NFANTRY SPECIFIC WEAPON SYSTEMS AND 
ALLCW THE EMBARKATION OFFICER TO SUBMIT THE REQUEST TO THE NAVY. 

COMMENT 

GNCUSNAVEURADOMSIXTHFLT MAS Mapajemert Site comm art (AIN98). C8F POC - N37, DSN 
626-9000. X83»0 

Figure 3: A Navy Lessons Learned Document 

To begin with, we used the Universal Naval Task List (UNTL, 1996) to manually identify the basic 
concepts for restructuring lessons. Examples of basic concepts include tasks, organizations, places, 
roles, and equipment. Initially, we manually populated the Navy Lessons SubWordNet and later using 
the SuE after it was implemented. Our intent is to extend this SubWordNet using techniques described 
here. 

3.2. SubWordNet Maintenance 

SubWordNet maintenance components that have been fully implemented include the SuE, SEE, and 
the SubWordNet database. Figure 4 shows the Alpha version of SuE. 


|^i|igg| 


--r-r-T" 

f^Hi’ «=• 

♦ a«w«. 

0«MM> 

♦ Q Upw 

Q.wm. 

» 

Q<mm> 

♦ ESraimm 

» OlMW 

D-*— 

♦ a**«YM 

0««* 

rnTfl :: tii < r-Tr.-r-TWixHT:nr' v Mi;T»*M 

■HBBHHSSiHHSSSSiSSSS 


Figure 4: Editing Navy Lessons SubWordNet using SuE 

SuE provides the ability to search and edit concepts belonging to the four parts of speech supported by 
WordNet. We used search along with drag and drop operations to relate over 200 noun concepts. 

For example, Figure 4 shows us editing the Navy lessons concept “after action review” and relating 
the concept “training evaluation” as its holonym. In addition, we used SuE to view and search the 
WordNet database (Version 1.6) that we have converted into MS Access format for use with our 
architecture. 


212 
















3.3. Concept Discovery Implementation 


The components for concept discovery have been partially implemented. The implemented 
components include the CDW and CDE components, the Tokenizer, the Lemmatizer, the POS 
Tagger, the Acronym Extractor, and the rule-based Phrase Generator. 


Figure 5 shows a part of the CDW interface with the acronym extraction results. The Acronym 
Extractor uses manually encoded extraction patterns to extract acronyms and their expansions that co¬ 
occur in free text. 



Figure 5: CDW with the Acronym extraction results 

This extractor was run on the set of 830 lessons, which extracted 957 unique expansions over 1411 
instances. One hundred and fourteen instances were incorrect extractions giving an overall precision 
of 92.6%. For example, one expansion for SIMAS (Sonar In-situ Mode Assessment System) and five 
acceptable expansions for MIO (e.g., Maritime Intercept Operations) were extracted. Various 
abbreviations such as SYSAD (System Administration) and EUCOM (European Command) were 
extracted as well. The majority of errors occurred in expansions for acronyms of size two primarily 
due to capitalized text. 


To measure recall, 5 sample sets of 5 lessons each were randomly drawn from 830 lessons and 
processed by the Acronym Extractor. These sample documents were manually examined for 
occurrences of acronym and expansion pairs and those missed by the processor were identified. Recall 
ranged from 85.7% to 100% on the five sets with an average of 91.8%. 

We have implemented a rule-based Phrase Generator that generates multi-word noun phrases by 
iteratively using the following three rules: noun + noun = noun, adjective + noun = noun, and 
adverb + noun = noun. Phrases of additional parts of speech such as verbs and adjectives are yet to be 
implemented. Phrase Generator is supported by Lemmatizer, which uses lemmatizing rules along with 
input from WordNet to derive lemmas, and the POS Tagger, which is our implementation of Brill’s 
part-of-speech tagger in Java (Brill, 1993). 

We tested the Phrase Generator on the 830 lesson documents. Figure 6 shows the Phrase Generator 
panel with the phrases and their attributes. 


213 


























R 

B5aj 


1 

mm 

-a'.'T-r 




Si 

i; 

555 

55 £■ 

\\ 


5gpa 



Si 

IbEf 

■■■ 


■ 



■;g 

~ -'ggjt 1 m | —W 

mm p 


■ 

— 

| 

m*i=i 

i|iji 

11 

1 

tm sji 


■1 H 

1 

■1 BHD 

—I' - 


— — ’■■■■■■■■■■!■■ 

— ■ 

[■III 

Vi VHK; 

K: 



55 £ 

'*"■**'' 




Figure 6: CDW Phrase Generator Panel with results 

Overall 52,132 phrase instances were generated, of which 23,892 phrases were unique. Examples of 
generated phrases include one word phrases such as “marines”, three word phrases such “Maritime 
interception operation” and it’s variant “maritime intercept operation” and longer phrases such as 
“civil emergency preparedness assistance program”. 

We examined the overlap of generated phrases with WordNet, both with the complete phrase and its 
headword. Only 2616 of the 23,892 complete phrases were found in WordNet (i.e., coverage of 
10.94%). Examples of phrases found in WordNet include “marines”, “standard operating procedure , 
and “cardiopulmonary resuscitation”. The coverage of phrase headword was 74.4%. Among 
headwords not covered by WordNet were “watchstander and COMLANTAREA. 

We also examined the overlap between extracted acronyms and generated phrases. This was 
substantial; 294 generated phrases were also acronym expansions and 571 phrases were acronyms. 
Examples of phrases include JTLS as the acronym and Joint Theatre Level Simulation as its 
expansion. 

Our observations suggest that the Navy Lesson Learned documents comprised a large majority of 
domain specific concepts not covered by WordNet. Furthermore, a small fraction of WordNet is 
applicable to our domain, thereby justifying the need for developing SubWordNets. 

CONCLUSIONS AND FUTURE WORK 

We presented an extensible software architecture for supporting Sub WordNet engineering that 
integrates the discovery of concept elements from source documents, identification of concept^ by 
reference to WordNet, and updating SubWordNet with discovered concepts. Our implementation of 
concept maintenance components showed that it can be used to edit SubWordNets including 
WordNet. The concept discovery components, the Acronym Extractor and the Phrase Generator 
demonstrated encouraging performance. We demonstrated that, even using a small subset of source 
documents, the overlap between the Navy Lessons sublanguage and WordNet was small. Our 
architecture has been designed to include relation discovery and concept identification techniques to 
support integrated SubWordNet engineering. We conjecture that our proposed approach is likely to 
produce a SubWordNet well suited to document restructuring and information extraction tasks in 
terms of their coverage and the required semantics. We plan to implement the remaining components 
of the CDE and the CIE and evaluate the architecture in a variety of US Navy NLP applications. 

Acknowledgements 

Thanks to the reviewers for their comments on an earlier version of this paper. We thank the Naval 
Research Laboratory for funding this research. 


214 









References 


Brill, E. (1993) A Corpus-Based Approach to Language Learning , Ph.D. Thesis, University of 
Pennsylvania, Pittsburgh, PA. 

Deerwester, S., Dumais, S. T., & Harshman, R. (1990) Indexing by Latent Semantic Analysis , Journal 
of American Society for Information Science, 41(6), pp. 391-407. 

Evans R., & Kilgarriff, A. (1995) MRDs, Dictionaries, and How To Do Lexical Engineering , 
Proceedings of the 2nd Language Engineering Convention, London, UK, pp. 125-132. 

Fellbaum, C. (1998) WordNet, An Electronic Lexical Database , The MIT Press. 

Gerhard H., Lauter, M., Quasthoff, U., Witting T., & Wolff, C. (2001) Learning Relations Using 
Collocations , Proceedings of IJCAI Workshop Ontology Learning, Seattle, WA, pp. 20-25. 

Gupta, K.M., Aha, D.W., Marsh, E., & Maney, T. (2001) Semi-automatically Restructuring Navy 
Lessons: A Proposal for Feasibility Assessment and Progress Report , Technical Report: AIC-01-004, 
Naval Research Laboratory, Washington, DC. 

Hearst, M.A. (1992) Automatic Acquisition of Hyponyms from Large Text Corpora , Proceedings of 
the Fourteenth International Conference on Computational Linguistics, Nantes, France, pp. 539-545. 
Maedche, A., & Staab, S. (2000) Discovering Conceptual Relations from Text, Technical Report # 
399, Institute AIFB, Karlsruhe University. 

Maedche, A., & Staab, S. (2001) Semi-Automatic Engineering of Ontologies from Text x Proceedings 
of 12 th International Conference in Software and Data Engineering, Chicago, USA. 

NLLS (2001), Navy Lessons Learned HomePage , http://www.nwdc.navv.mil/Command/NavvLessons 
Leained/nllsoverview.asp 

Rada M., & Moldovan, D. I. (2001) Automatic Generation of a Coarse grained WordNet , Proceedings 
of NAACL 2001 Workshop on WordNet and Other Lexical Resources: Application, Extensions and 
Customizations, Pittsburgh, PA, pp. 35-40. 

UNTL. (1996) The Universal Naval Task List, OPNAVINST 3500.38/MCO 3500.36 USCG 
OMDTINST M3500.1, httD.7/nwdc-db- 

n.nwdc.navy.mil/pack/webserver code.mainmenu?root id=l 6&levels=l. 

Vossen, P. (2001) Extending, Trimming and Fusing WordNet for Technical Documents , Proceedings 
of NAACL 2001 Workshop on WordNet and Other Lexical Resources: Application, Extensions and 
Customizations, Pittsburgh, PA, pp. 125-131. 

Woods, W.A. (1997) Conceptual Indexing : A Better Way to Organize Knowledge , Sun Technical 
Report: TR-97-61, Palo Alto, CA 94303, USA. 

Yamaguchi, T. (2001) Acquiring Conceptual Relationships from Domain-Specific Texts , IJCAI 
Workshop on Ontology Learning, Seattle, Washington, pp. 13-18. 

Zemik, U. (1991) Introduction. In Exploiting On-Line Resources to Build a Lexicon, (Ed. U. Zemik), 
Lawrence Earlbaum, Hillsdale, New Jersey, pp 1-26. 

Kalyan Moy Gupta, Advanced Engineering Sciences, ITT Industries, 2560 Huntington Avenue, Alexandria, VA. 

22306, USA and Navy Center For Applied Research In Artificial Intelligence, Naval Research Laboratory 

(Code 5510), 4555 Overlook Avenue Sw, Washington, Dc, 20375, USA, eupta@aic.nrl.navv.mil 

David W. Aha, Navy Center for Applied Research in Artificial Intelligence, Naval Research Laboratory 

(Code 5510), 4555 Overlook Avenue SW, Washington, DC, 20375, USA, aha@aic.nrl.navy.mil 

Tucker Maney, Navy Center For Applied Research In Artificial Intelligence, Naval Research Laboratory 

(Code 5510), 4555 Overlook Avenue Sw, Washington, Dc, 20375, USA, manev@aic.nrl.navv.mil 


215 



Extending Synsets with Medical Terms 


Paul Buitelaar 
Bogdan Sacaleanu 


Abstract 

An important problematic issue with general semantic lexicons like WordNet or GermaNet is that they do 
not cover many terms and concepts specific to certain domains. Therefore, these resources need to be 
tuned to a specific domain at hand. This involves selecting those senses that are most appropriate for the 
domain, as well as extending the sense inventory with novel terms and novel senses that are specific to 
the domain. In this paper we focus on extending GermaNet synsets with domain specific terms, taking 
into account the domain relevance of senses (i.e. synsets). 

1 Introduction 

Natural language applications, such as information extraction and machine translation, require 
a certain level of semantic analysis. An important part of this process is semantic tagging : the 
annotation of each content word with a semantic category. Semantic categories are assigned 
on the basis of a semantic lexicon like WordNet for English (Miller et al., 1995) or similar 
resources like GermaNet for German (Hamp and Feidweg, 1997). A problematic issue, 
however, is that general semantic lexicons like WordNet or GermaNet do not cover many 
terms and concepts specific to certain domains. Therefore, these resources need to be tuned to 
a specific domain at hand. This involves selecting those senses that are most appropriate for 
the domain, as well as extending the sense inventory with novel terms and novel senses that 
are specific to the domain. Some work in this area has been reported, with an emphasis on 
domain specific sense selection (Basili et al., 1997; Cucchiarelli and Velardi, 1998; Turcato et 
al., 2000). In (Buitelaar and Sacaleanu, 2001) a bottom up approach to sense selection was 
reported, which determines the domain specific relevance of (WordNet, GermaNet) synsets 
on the basis of the relevance of their constituent synonyms that co-occur within representative 
domain corpora. In this paper we focus on extending GermaNet synsets with domain specific 
terms, taking into account the domain relevance of concepts (i.e. senses) as computed by the 
method described in (Buitelaar and Sacaleanu, 2001). 

We approach the extension task from two angles: through morphological analysis 
(decomposition) and through learning semantic similarity from co-occurrence patterns on 
domain specific corpora. The system includes a linguistic preprocessing step in which all 
words are annotated with part-of-speech and morphological information. We used the TnT 
tagger (Brants, 2000) for part-of-speech tagging and the MMORPH package (Petitpierre and 
Russell, 1995) for morphological analysis. The medical domain corpus used for the 
experiments reported here has been collected in the context of the MUCHMORE project on 
cross-lingual retrieval of medical information (Buitelaar, 2000). The corpus consists of 
abstracts of scientific articles in various areas of medical research as obtained from the 
Springer LINK web site (http://link.springer.de/ ). 

2 Extension by Decomposition 

As German is a highly compositional language, morphological decomposition is the most 
intuitive way of acquiring novel terms from German domain specific corpora. Every 
compound is a specification of its head (i.e. stem). Therefore, compounds can be easily added 
to GermaNet as hyponyms of this head word. For instance, some compounds with head 
Therapie ( therapy ) in the medical corpus are: 



Antibiotikatherapie (anti-biotics therapy), Gentherapie (gene therapy), Lasertherapie (laser 
therapy), Sauerstofftherapie (oxygen therapy), Toxoplasmosetherapie (toxoplasmosis therapy) 

In order to limit this process to domain relevant terms and synsets, each term is assigned a 
term relevance relative to its occurrence in other domain corpora as described in (Buitelaar 
and Sacaleanu, 2001). The relevance measure is a slightly adapted version of standard tf.idf 
as used in vector-space models for information retrieval (Salton and Buckley, 1988): 

rlv(t | d ) = log( tft ,</)log( ~~) 

d J< 

where t represents the term, d the domain, N is the total number of domains. This formula 
gives full weight to words that occur in just one domain and a weight of zero to those 
occurring in all domains. The term relevance of each term is used to compute a relevance 
measure also for each synset (i.e. sense) in which these terms occur as a synonym. According 
to this relevance measure, synsets are ranked and the top most synsets selected as domain 
relevant. The extension process described in this paper is restricted to these top most domain 
relevant synsets and top most novel terms. 

2.1 Heads with One Sense 

Adding compounds to GermaNet is straightforward, if the headword in question has only one 
sense. For instance, Tumor (tumor) has only the following sense: 

#1 [Geschwulst, Geschwiir, Tumor] (blastoma, ulcer, tumor) 

Hence, the following compounds can be added to GermaNet through the hyponymy relation: 

' • • ■ ■ ■ • : , ■ i • - • • 

Blasentumor (blatter r tumor), Magentumor (stomach tumor), Schadelbasistumor (cranial base 
tumor), Talgdriisentumor (sebaceous glands tumor), Wirbelsaulentumor (spinal tumor) 

us .■!• . f:; :* . 

2.2 Heads with More Senses 

The acquisition process can be easily automated for those headwords that have only one 
sense. More frequently, however, headwords have at least two senses. This introduces an 
ambiguity in adding compounds as hyponyms to one of the senses. We distinguish two cases: 
1. only one sense is relevant to the domain; 2. two or more senses are relevant to the domain. 
The first case applies if only one sense of a given headword was determined to be domain 
relevant by the automatic method described above. It is then to be expected that all 
compounds of the headword also refer to this sense 1 . Take for example Gewebe (tissue): 

#7 [Gewebe, Korpergewebe] (tissue, body tissue) 

#2 [Gewebe, Stoff, TextilstoJJ] (tissue, cloth, textile) 

Since only the first sense applies to the medical domain, all compounds that were 
automatically extracted from the medical corpus can be added as hyponyms of sense #1, e.g.: 

Entziindungsgewebe (infection tissue), Karzinomgewebe (carcinoma tissue), Pankreasgewebe 
(pancreas tissue), Schilddriisengewebe (thyroid gland tissue) 


1 Unfortunately, even if the head word has a clearly dominant sense within the domain some instances of other 
senses may occur as well — i.e. Polstergewebe (cushionJpad tissue) in our medical corpus. 


217 




Compounds of a head word may be added as hyponyms of either sense if two or more senses 
were determined to be equally relevant within the domain. Consider for instance the noun 
Infektion (infection) with the following two senses: 

#7 [Entziindung,Infektion, Infektionskrankheit] (infection, inflammation, infectious disease) 

#2 [Ansteckung, Infektion, Ubertragung] (infection, transmission) 

Some of the compounds extracted from the medical corpus with this noun as head are given 
below. The list contains hyponyms of both senses, with underlined terms corresponding to the 
second sense. 

Blutstrominfektion (blood flow infection), Erstinfektion (initial infection), Ilautinfektion (skin 
infection), Krankenhausinfektion (hospital infection), Luftweginfekion (airborne infection) 

To add the right compounds as hyponyms to the right sense, some additional processing is 
needed. Clustering techniques could be used to automatically separate the compounds in 
several groups, each of which corresponding to a sense of the head word. In the current 
system, however, clustering has not yet been implemented. For now, a supervised process is 
assumed in which a domain expert decides which compounds are added as hyponyms of 
which sense. 

3 Extension by Similarity 

Much of the work on the acquisition of semantic classes has been based on statistics over co¬ 
occurrence of words within a fix window of text, where a window can be a number of words, 
a sentence, a paragraph, or even an entire document (e.g. Church and Hanks, 1990; Brown et 
al., 1992). The results of these approaches have shown that a simple frequency analysis of 
words co-occurring with other words may indicate classes of similar meanings. Here we 
present an approach to semantic classification that uses patterns of lcxico-syntactic context to 
discover semantic similarities between classes in GermaNet (i.e. synsets) and novel terms that 
are not currently in GermaNet. The hypothesis on which this work is based is that words used 
in similar syntactic contexts and with a large overlap in lexical information will be 
semantically similar. In other words, we intend to classify words by means of their lexical 
contexts under consideration of syntactic constraints. 

3.1 System Overview 

The system assumes a set of novel terms and a set of domain specific synsets to which these 
terms will be assigned (classified). Both novel terms and domain specific synsets are selected 
using the methods discussed in (Buitelaar and Sacaleanu, 2001). For each of the novel terms 
and for each of the synonym terms of the synsets, lexico-syntactic patterns are extracted from 
the corpus and a co-occurrence measure is computed on each of their instances. Finally, an 
instance-based learning algorithm is used to generate a classifier for each of the patterns, 
which is used to automatically assign a novel term to one of the synsets. 

3.2 Lexico-Syntactic Patterns 

For each novel term and for each synonym term in a domain relevant synset a set of lexico- 
syntactic patterns within a window of n words is extracted (in all experiments reported here n 
= 7). Here, we only consider nouns, adjectives and verbs as relevant, with all other word 
classes being marked as irrelevant (“null”). For instance, the following pattern represents the 


218 



context of a term (7) with two irrelevant words and an adjective on its left side, and an 
irrelevant word, followed by an adjective and a noun on the right: [null, null, ADJ, T, null, 
ADJ, NN]. Then, for each pattern all corpus instances are extracted. In this way, each novel 
term and each synset is represented by a number of lexico-syntactic context instances that are 
used in classification. For each context instance, we compute a mutual information score for 
all co-occurrence pairs. A co-occurrence pair, written as (x, y), represents the co-occurrence 
of a term, x, with a context word, y, within a context instance. Let N be the total number of 
words that occur in all instances of a given pattern. Using the Maximum Likelihood Estimator 
(MLE), the probability of a pair, P{x, y), is estimated by its relative frequency : 


P(x,y)~ 


N 


where J[x, y) is the count of (x, y) pair over all instances of the pattern. Similarly, the 
probability for an occurrence of a word, P(w), is estimated by: 


P(w)« 


m 

N 


The mutual information of a co-occurrence pair MI(x, y) is then estimated by: 


MI(x,y)= log* 


P{*>y) 

P(x)P(x) 


f 

«log> 


/(*,T) 

/M/O') 



where x is a term and y a context word. The frequency of synsets is defined by the sum of the 
frequencies of its component synonym terms. The co-occurrence frequency of a synset with a 
context word is then defined by the co-occurrence of the context word with the synonym 
terms. Thus, for a given synset C: [tj, t 2 , /j, ..., tj and a context word w, the mutual 
information will be defined as follows: 


MI(C, w) «log; 


1 '/(<!,w) 

M_ 


-N 


I mm 

M 


where U are synonym terms of synset C. To arrive at a mutual information score between 
synsets and their contexts as a whole, we decided to simply take the sum of the mutual 
information for all context words. 


3.3 Instance-Base Learning 

Deciding on similarity between terms and term classes (i.e. synsets) according to a shared 
context is a task well suited for machine learning. More specifically, we decided to use an 
instance-based - k-nearest neighbor ~ classifying algorithm that uses all of the context 
instances to assign the most similar class (synset) to a novel term (Witten, 2000). 

For each of the syntactic patterns the learning system creates a data model that consists 
of all context instances for each synset. For instance, for a lexico-syntactic pattern of the 
form: [NN, null, ADJ, (T/C), null, NN, null]. Context instances for a synset C are represented 


219 



in the format 2 : C, MI, Noun it Adjective jt Nourik. Similarly, a data model of context instances is 
created for each novel term T in the following format: T, MI, Nouni Adjective m , Noun „. 

In classification, only assignments that were made uniformly by different ^-values are 
considered. Results take the form of a list with one assignment for each pattern. In order to 
obtain the most likely one among these, we introduce a simple bagging strategy, which selects 
the most frequent assignment. 

3.4 Experiments 

In order to test the classification system, we ran an experiment on a corpus of medical 
abstracts (Buitelaar, 2000). Using the methods discussed in (Buitelaar and Sacaleanu, 2001), 
we automatically extracted a set of domain relevant synsets and domain relevant novel terms. 

For evaluation purposes, we asked a medical domain expert to manually classify the 
top 150 novel terms, given a selection of the most domain relevant synsets. From the top 25 
synsets, as proposed by the system, the medical domain expert discarded 4. Further, only 56 
novel terms could be manually classified given these 21 synsets. The evaluation set therefore 
consists of 56 novel terms classified in the following synsets (in order to increase coverage of 
each synset on the medical corpus, we included synonyms ~ bold — and direct hyponyms): 

C/: [Geschwulst, Geschmir, Tumor, Abszefl,...] (blastoma, ulcer, tumor, abscess) 

C 2 '. [Krankheit, Abhangigkeit, Anfall, Attache, ...] (disease, addiction, seizure, attack) 

C3: [Gewebe, Korpergewebe, Bindegewebe,...] (tissue, body tissue, connective tissue) 

CV [Entzundung , Infekt, Infektion, ...] (infection, inflammation) 

C 5 : [Krankheitsbildy Syndrom,...] (clinical syndrom) 

C(j: [Symptom] (symptom) 

C 7 : [ Gelenk, Ellbogen, Fingergelenk,...] (joint, elbow, finger joint) 

C$: [Reduktion, Reduzierurtg, Abbau, ...] (reduction, decrease , atrophy) 

Cj>; [Anordnung, Aufstellung , Formation, ...] (order, disposition, formation) 

C 10 : [Medizin, Chirurgie, Frauenheilkunde, ...] (medicine ; surgery, gynecology) 

C//.* [Quote, Rate, Beschleunigungsquote, ...] (proportion, rate) 

C 12 : [Parameter] (parameter) 

C 13 : [Blutung, Blutverlust] (bleeding, loss of blood) 

C 14 : [Facharzt, Augenarzt, Chirurg, .../ (specialist, ophthalmologist, surgeon) 

C 15 : [Leiden, Allergie, Andrnie, Artrose, ...] (ailment, allergy, anemia, arthosis) 

Ci 6 '. [Zelle, Korperzelle, PJlanzenzelle] (cell, body cell, plant cell) 

Ci 7 : [Eingriff, Operation, Abtreibung,...] (operation, abortion) 

Cig: [Abhandlung, Studie] (survey, case study) 

C 19 : [Prophylaxe, Empfangnisverhiitung, ...] (prophylaxis, contraception) 

C 20 : [Druse, Bauchspeicheldriise,...] (gland, pancreas) 

C 21 : l Krankheitssymptom , Symptom] (disease symptom, symptom) 

For each of the patterns, a classifier assigns a synset to each of the 150 novel terms. We used 
five different values of k (3, 6, 9, 12, 15) to validate results. Only assignments that were 
invariant for all values of k are kept. Through our bagging strategy we then select from among 
a)l the assignments the most frequent one. Some examples of correctly classified novel terms 
are given below. A full account of results is presented in Table 1. 

2 As null attributes play no further role in the classification process, they are discarded in the representation of 
the context instances. 


220 



C\: Karzinom (carcinoma), Metastase (metastasis), Neoplasie (neoplasia) 

Cjj: Pravalenz (prevalence), Spezifitat (specificity) 

C 17 : Resektion (resection), Transplantation (transplantation) 

To test our approach in using lexico-syntactic patterns, we also ran an experiment that takes 
into account the lexical context but with more flexible syntactic constraints. For this purpose 
we extracted contexts in windows of 3 words on each side of the novel term, as in the original 
approach, but instead of taking into account the position of these words we now only consider 
their order of occurrence. 

Consider the same example as before: [NN, null, ADJ, (T/C), null, NN,null]. Ignoring 
syntactic constraints, we now only consider the occurrence of NN and ADJ on the left and NN 
on the right. Therefore, this pattern is then equivalent to all of the following patterns and 
many more: [NN, ADJ, (T/C), NN], [null, NN, ADJ, (T/C), NN], [NN, null, ADJ, (T/C), null, 
NN]. As shown by the results below, our original approach outperforms this alternative 
approach, which indicates that next to lexical context also a representation of syntactic 
constraints on this context is an important source of information. 

Finally, we also evaluated our strategy to simply sum the mutual information scores 
for all (relevant: ADJ, NN, VERB) context words. Instead, we ran an experiment with our 
original approach, keeping mutual information scores separate for each context word. Given 
the same example again, a context instance then has the following format: C, MIj, Mly, MI 3 , 
Noun, Adjective, Noun. 



Manual 

System Correct 

Approach! 

56 

— ll'HII 1 'il 


. 56 • 

—MLEIKMl 

OTHHSlBI 

56 

—irroinni 


Table t: Overall Results for Each Approach 

. : ! ; .* . •**•* •• * ’ * •*’**# • * *, .1 ' “ ’ i • . . / ; j' 

3.5 DiSCUSSion 

Our original approach gives the best results (about 41% precision), which we may compare 
with a completely random classification of only 5%. A comparison with other systems is ; 
almost impossible. First of all, to our knowledge, no other work exists on the automatic 
classification of terms to WordNet/GermaNet synsets. Work on term clustering is related to 
our work but not directly comparable. Also, comparing classification results between domains 
is not straightforward. Our best result is about 41%, which is relatively high given the 
completely unsupervised nature of our approach. Classification is performed without any 
prior sense disambiguation. 

The main problem we encountered in this work was the fact that a lot of the synsets in 
GermaNet were deemed problematic from the medical point of view. For instance, the synset 
[Abhandlung, Studie] connects Studie (case study) with Abhandlung (survey). Additional 
problems arose in connection with PoS tagging and morphological analysis, specifically 
concerning compounds, both of which need to be further adapted to the medical domain. 
Finally, from the machine learning point of view, we decided that the attributes we use (a 
context word and its mutual information score) are dependent on each other, which needs to 
be reflected in the data model. The instance-based algorithm we currently use does not 
provide such an option. 


221 









4 Acknowledgements 

We would like to thank Jorg Bay and Oktavian Weiser for providing us with medical 
expertise. This research has in part been supported by EC/NSF grant fST-1999-11438 for the 
MUCHMORE project. 

5 References 

Basili R., Della Rocca M. and Pazienza M.-T. Contextual Word Sense Tuning and 
Disambiguation. Applied Artificial Intelligence, vol. 11, 1997. 

Brown P., Pietra, V., deSouza P. V., Lai J., and Mercer R. L. Class-based n-gram models of 
natural language. Computational Linguistic, 18:467-479, 1992. 

Buitelaar, P. MUCHMORE: Multilingual Concept Hierarchies for Medical Information 
Organization and Retrieval. In: Proceedings of AS1S, Chicago, 2000. 

Buitelaar P. and Sacaleanu B. Ranking and Selecting Synsets by Domain Relevance. In: 
Proceedings NAACL WordNet Workshop, 2001. 

Church, K. and Hanks, P. Word Association Norms, Mutual Information, and Lexicography. 
Computational Linguistics, vol. 16:1,22-29, 1990. 

Cucchiarelli A. and Velardi P. Finding a Domain Appropriate Sense Inventory for 
Semantically Tagging a Corpus. In: Journal of Natural Language Engineering, 1998. 

Brants, T. TnT - A Statistical Part-of-Speech Tagger. In: Proceedings of 6 th ANLP 
Conference, Seattle, WA, 2000. 

Hamp, B. and Feldweg, H. GermaNet: a Lexical-Semantic Net for German. In: Proceedings 
of the ACL/EACL97 workshop on Automatic Information Extraction and Building of Lexical 
Semantic Resources for NLP Applications, Madrid, 1997. 

Miller, G.A. WordNet: A Lexical Database for English. Communications of the ACM, 1995. 
Petitpierre, D. and Russell, G. MMORPH - The Multext Morphology Program. Multext 
deliverable report for the task 2.3.1, ISSCO, University of Geneva, 1995. 

Salton, G. and Buckley, C. Term-Weighting Approaches In Automatic Text Retrieval. In: 
Information Processing & Management. 24, 5, pp.515-523, 1988. 

Turcato D., Popowich F., Toole J., Fass D., Nicholson D. and Tisher G. Adapting a synonym 
database to specific domains. In: Proceedings of the ACL workshop on recent advances in 
NLP and IR. Hong Kong, 2000. 

Witten, Ian H., Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques 
with Java Implementations. Morgan Kaufmann Series in Data Management Systems, 2000. 

Paul Buitelaar, Bogdan Sacaleanu 
DFK1 GmbH 
Stuhlsatzenhausweg 3 
D-66123 Saarbruecken, Germany 

{oaulb. bozdanl (a),dfJd.de 


222 



Characterizing the Definitions of Anatomical Concepts 
in WordNet and Specialized Sources 


Olivier Bodenreider 
Anita Burgun 


Abstract 

The objective of this study is to characterize the definitions of anatomical concepts in a general 
terminological system (WordNet) and a domain-specific one (a medical dictionary). Methods: 
Definitions were first classified into five groups with respect to the nature of the definition. The principal 
noun phrase (or head) of the definiens was then compared to the definiendum through a reference 
hierarchy of anatomical concepts. Results: This study confirms the predominance of genus-differentia 
definitions for anatomical terms. Hierarchical relationships are, as expected, the principal type of 
relationships found between the definiendum and the head of the definiens. Discussion: Differences in 
the characteristics of the definitions between WordNet and medical dictionaries are presented and 
discussed. 

Introduction 

We are interested in characterizing the definitions of medical terms in various sources in order to get a 
better understanding of their structure. Our ultimate goal, though, is to obtain a representation of the 
definitions in a formalism such as conceptual graphs in order to compare definitions from various 
sources. This study is part of a larger project aimed at comparing definitions of medical terms in 
specialized sources such as medical dictionaries with those in general resources such as WordNet®. In 
other words, our goal is to compare definitions of medical terms for health professionals and for users of 
consumer health applications. 

Although not completely unrelated to them, the task of characterizing definitions is quite different from 
other tasks in which definitions are involved, especially acquiring definitions from a corpus (see Klavans 
and Muresan (2000) for an application to the medical domain) or acquiring an ontology from definitions 
as proposed by Shaikevich (1985). 

1. Background 

1.1. Kinds of Lexical Definitions 

As in most dictionaries, the definitions in both medical dictionaries and WordNet are made of two parts: 
the term to be defined (or definiendum) followed by or linked to the expression used to define it (or 
definiens). Besides relying on synonymy or antonymy, i.e. linking a term to its synonym or opposite, 
several methods can be used to create dictionary definitions, also called lexical definitions. In a Genus- 
Differentia definition, the definiendum is described first by a broader category, the genus, then 
distinguished from other items in that category by differentia. Although a similar method has been long 
used to classify living organisms, its application extends beyond the domain of biology. Other kinds of 
definitions include those describing the cause or the function of the definiendum [Swartz (1997)]. 

1.2. Definitions in WordNet 

WordNet, an electronic lexical database, has been developed and maintained at Princeton University 
since 1985 [Fellbaum (1998)]. Sets of synonymous terms, or synsets, constitute its basic organization. 



The current version (1.7) integrates about 100,000 synsets. Several types of relations between synsets are 
recorded in WordNet, including hyponymy and meronymy. In addition, each synset has a definition (or 
gloss) that defines the synset. 

WordNet has been compared to specialized knowledge sources, including sources in the biomedical 
domain see for example Burgun and Bodenreider (2001). Comparison often relies on the semantic 
structure of WordNet, i.e., the relationships represented in WordNet (synonymy among terms and 
hierarchical relationships among synsets). However, as noted by Harabagiu and Moldovan (1998), the 
definitions also represent an interesting source of knowledge. They propose to transform the definitions 
into directed acyclic graphs whose nodes are WordNet synsets and whose links are lexical relations. 
While the somewhat stereotyped structure of most WordNet glosses is expected to facilitate the analysis, 
they acknowledge that ambiguity, both lexical and semantic, is likely to represent a major difficulty. 

1.3. DEFINITIONS OF ANATOMICAL TERMS 

We selected anatomy as the domain of our study because it is central to the larger biomedical domain 
and to . some extent part of the general domain. Not surprisingly, anatomy is well-represented in 
WordNet, where the synset “body part” has 1785 hyponyms, direct and indirect. Specialized resources 
such as the University of Washington Digital Anatomist symbolic knowledge base (UWDA) created by 
Rosse et al. (1998) constitute authoritative resources useful for establishing a list of anatomical terms. In 
addition, UWDA will also be useful for building the lattice of anatomical concepts needed for analyzing 
conceptual graphs later in this project. UWDA, however, could not be used as a source of definitions in 
this study because it provides .definitions orily for some high-level concepts. 

Anatomy inherently results from observation, sometimes long before the entity observed can be named 
and classified. As a consequence, in anatomical definitions generally, what is attached to a lexical entry 
is still sometimes a description, useful for locating the physical entity while observing or dissecting, 
rather than a definition, for locating the concept in semantic space. For example, a description of 
“Adrenal gland” may refer to its shape, color and location relative to the kidney. Depending on the 
source, descriptions are either free-text or structured. For example, a template for the description of 
nerves includes information about their origin, their distribution and their branches. 

A major kind of lexical definition, which includes definitions of anatomical concepts, is the traditional 
genus-differentia definition, in which the genus is often a broad category such as “organ” or “muscle” 
and the differentia can be, among others, a location (e.g., “situated near the kidney”) or a function (e.g., 
“that carries blood from the heart to the body”). In addition to genus-differentia definitions, in which the 
definiendum is by definition in a taxonomic relationship with the definiens, various kinds of definitions 
may be found for anatomical terms. These include definitions by meronymy, in which the definiendum is 
in a ‘part of relationship with the definiens and definitions emphasizing a property or a function, 
expressed by a general term, instead of a genus. Examples of the various kinds of descriptions and 
definitions found for anatomical terms are given in Table 1. 



Subcategory 

Example _—--—— 

Definition 

Genus-Differentia 

Tarsal bone: the seven bones of the ankle 

Meronymy 

Small intestine: the proximal portion of the intestine 

Property 

Diaphragm: a muscular partition separating the abdominal and thoracic cavities 

Description 

' 1 

Free-text 

Adrenal gland: a flattened body situated in the retroperitoneal tissues at the 
cranial pole of each ltidney ^ -—— 

Structured 

Soleus muscle: 

• origin, fibula, popliteal fascia, tibia; 

• insertion, calcaneus by tendo calcaneus; 

• action, plantar flexes ankle ioint 


Table 1: Categories of descriptions and definitions found for anatomical terms. 


224 


2. Material 


2.1. Source of Anatomical Terms 

Starting with approximately 4000 concepts in UWDA, we used the term listed in UWDA as the 
“preferred term” for each concept. This is the term used in most anatomy textbooks, as opposed to, say, 
clinical variants. We then filtered out terms corresponding to highly specialized concepts, not likely to be 
found in a general resource. We used filters based on the presence in the term of adjectival or 
prepositional modifiers indicative of the specialization of the term (e.g., left / right, anterior / posterior, 
mention of a particular vertebra, finger or toe). For example, “median nerve” belongs to our list while 
“right median nerve” was filtered out. Names for specific joints (e.g., “Calcaneocuboid joint”) and 
ligaments (e.g., “Patellar ligament”) were also filtered out, leaving mostly muscles (e.g., “Biceps 
brachii”) and nerves (e.g., “Sensory nerve”) in addition to organs such as heart and lung and organ 
categories such as gland and muscle. Applying these filters, we selected 420 terms (about 10%) suitable 
for further analysis. 

2.2. Source of Definitions 

We used WordNet (1.7) as the general resource (using WordNet glosses as definitions) and Dorland’s 
medical dictionary (27 ,h 'Edition) as the specialized resource. 

Out of the 420 anatomical terms selected, 134 were defined in WordNet and 213 in Dorland’s. The 
definitions of the 117 anatomical terms found in both sources were finally selected as the material for 
this study. 

3. Methods 

3.1. Resolving ambiguity 

Ambiguity was found in both WordNet and Dorland’s when trying to map anatomical terms to these 
resources. 

Anatomical terms were mapped to WordNet using the standard wn function. When the mapping resulted 
in multiple senses, having anatomy as a target helped selecting the correct sense, i.e., the synset with 
“body part” in its hypemyms. In the rare cases of mapping to multiple hyponyms of “body part”, the 
synset at the deepest level of the hierarchy was selected. The definitions of the few anatomical terms 
mapped to WordNet but outside the hierarchy of body parts (e.g., “intervertebral disc”) were not used in 
this study. 

Like many dictionaries, Dorland’s lists definitions for the multiple senses or usages of a lexical entry as 
numbered definitional items. When an entry had multiple definitions, the correct one was selected 
manually. 

3.2. Preparing the Definitions 

The definitions of anatomical terms in WordNet are often limited to one sentence and were processed 
entirely. By contrast, Dorland’s definitions are often encyclopedia definitions. For this reason, only the 
first sentence of Dorland’s definitions was considered in this study. 


225 



3.3. Classifying the Definitions 


The definitions were analyzed manually by the two authors, using the following strategy to classify them 
with respect to the kind of their defmiens. The first issue was to distinguish between definition and 
description as they were defined in section 1.3. Then, definitions were classified in the following three 
subcategories: genus-differentia definition, definition by meronymy and definition based on a property. 
Descriptions were classified in two subcategories: free-text descriptions and structured descriptions. 
Definitions whose defmiens did not fit any of these kinds were marked for separate analysis. 

When the two authors disagreed about the classification of a definition, it was analyzed again until a 
consensus was reached. 

3.4. Analyzing the Relationship of the Definiendum to the definiens 

We used the MetaMap program developed by Aronson (2001) to map the definiens to the Unified 
Medical Language System® (UMLS®) [Lindberg et al. (1993; UMLS (2001)]. As a result, we extracted 
all biomedical concepts from the definiens, allowing us to access properties such as their semantic 
category and their relationships to other concepts. In addition, we took advantage of the shallow 
syntactic analysis provided by MetaMap in order to identify the first noun phrase (or head) of the 
definiens. When the noun in the first noun phrase was "pair” (e.g., “the twelve pairs of nerves connected 
with the brain”), the next noun phrase was used as the head. A similar correction was used to 
systematically prevent some adjectives from being interpreted as nouns (e.g., “longest” in “the longest 
and thickest bone of the human skeleton”). 

Since the definiendum comes from UWDA, which is one of the constituent vocabularies in the UMLS 
and the concepts extracted from the definiens by MetaMap are also UMLS concepts, the various kinds of 
relationships recorded in the UMLS can be exploited to compute whether medical concepts from the 
definiens (especially the head) are related to the definiendum. The following relationships were sought 
between the concepts corresponding to the definiendum and the head of the definiens: ancestor, 
descendant, sibling, other (usually associative) relationship. In addition, the relationship between the 
definiendum and the head was considered to be synonymy when the two terms mapped to the same 
UMLS concept. 

3.5. COMPARING THE TWO APPROACHES 

In order to study whether there is a relationship between the two methods of characterization (class of 
definition and relationship of the definiendum to the head of the definiens), we built a table of 
contingency to summarize the cross-classification of the definitions into these two characteristics. 

4. Results 

4.1. classification of the definitions 

The distribution of the definitions into the various classes introduced in Table 1 is summarized in Table 
2. While a large majority of the 234 definitions examined correspond to true definitions, some 12% of 
them are actually anatomical descriptions, structured or not. Not surprisingly, two thirds of the 
definitions follow the Aristotelian pattern of genus and differentia. 

In eight cases, the definition did not meet any of the classification criteria. Five of these cases involved 
the definition of an adjective by Dorland’s rather than that of the corresponding noun (e.g., “pisiform: 
resembling a pea in shape and size” instead of the wrist bone called “pisiform”). Other outliers included 


226 



one reference to a table, one reference to a synonym, and the definition of a subentry that is not valid 
outside the context of the entry (“small bone: one whose main dimensions are approximately equal”). 



Subcategory 

N 

% 

Definition 

Genus-Differentia 

155 

66 

Meronymy 

14 

6 


30 

13 

Description 

Free-text 

13 

6 

Structured 

14 

6 

Other 

81 

3 

Total 

234 

100 


Table 2: Categories of descriptions and definitions found for anatomical terms. 

4,2. Relationship of the Definiendum to the Definiens 

The distribution of the relationship of the definiendum to the head of the definiens as defined in section 
3.2 is summarized in Table 3. In two cases corresponding to the definition of an adjective instead of that 
of the wrist bone qualified by this adjective (e.g., “pisiform”), no concept could be identified by 
MetaMap from the definition. The total number of relationships studied between the definiendum and 
the head is thus 232 (out of the 234 definitions). 

Examples of synonymy between the definiendum and the definiens include “Axis: the second cervical 
vertebra” and “maxilla: the upper jawbone in vertebrates”. Although these definitions meet the criterion 
for a genus-differentia definition, the relation of the definiendum to the definiens is actually synonymy 
rather than hyponymy, the two terms being clustered into the same UMLS concept. 

Hierarchical relationships are the principal type of relationships found between the definiendum and the 
head of the definiens. 

Although descendant relationships usually denote an error in the mapping of the definiens to UMLS 
concepts, some definitions use holonymy (the inverse of meronymy) to relate the definiendum to the 
definiens, for example, in “nerve: any bundle of nerve fibers running to various organs and tissues of the 
body”. 

Finally, sibling relationships between the definiendum and the definiens either correspond to a kind of 
definition other than genus-differentia or denote some potential knowledge representation issue in the 
UMLS (e.g., although the patella is indeed “a triangular sesamoid bone”, no medical vocabulary in the 
UMLS records any hierarchical relationship between the concepts “patella” and “triangular bone”). 


Relationship 

N 

% 

Synonymy 

8 

3 

Ancestor, first-generation 

82 

35 

Ancestor, other 

48 

21 

Descendant 

3 

1 

Sibling 

19 

8 

Other (usually associative) 

0 

0 

None 

72 

31 

Total 

232 

100 


Table 3: Relationship between the definiendum and the head of the definiens. 


227 
















































4.3. Comparison between the two approaches 

Table 4 summarizes the cross-classification of the definitions into the two characteristics studied: class 
of definition and type of relationship of the definiendum to the head of the definiens.. 

Since by definition of genus-differentia definitions the genus is a broader category compared to the 
definiendum, the ancestor relationship is logically predominant in the genus-differentia definitions. 
However, the number of definiens mapping to an ancestor of the definiendum is slightly less than the 
number of genus-differentia definitions. In addition, not all hierarchical relationships in UMLS are 
taxonomic, therefore it is not surprising to find that some relationships listed as ancestor actually 
correspond to meronymic definitions. 

For most definitions based on a property, there is usually no relationship found in the UMLS between 
the definiendum and the property. The concept used to represent the property is often general (e.g., 
“sac”, “tube”) while some concept more specific to the domain of anatomy could have been used instead 
(e.g., “saccular viscus”, “tubular viscus”). 

Almost the same thing could be said about the descriptions, especially free-text descriptions where the 
head of the definiens is a general term (e.g., "structure”, “unit”, “mass”), not related to the definiendum. 



Definition 

Description 



Genus- 

differentia 

Meronymy 

Property 

I 

1 

Structured 

b 

■5 

O 

Total 

Synonymy 

5 


1 


1 

1 

8 

Anc., 1 st gn 

77 

4 

1 




82 

Anc., other 

46 


1 

1 



48 

Descendant 

1 

2 





3 

Sibling 

11 

1 

5 

1 

1 


19 

None 

15 

7 

22 

11 

12 

5 

72 

Total 

155 

14 

30 

13 

14 

6 

232 


Table 4: Cross-classification of the definitions into the two characteristics studied. 

5. Discussion 

5.1. GENERAL VERSUS SPECIALIZED RESOURCES 

Not only different characteristics of the definitions can be compared to each other, but it is also possible 
to take advantage of the characteristics to profile a source of definitions or to compare several sources. 
For example, Table 5 shows the classification of the definitions reported in section 4.1 but analyzed 
separately for each source. 

Although the number of definitions is too small and the domain too limited to draw any definitive 
conclusion, it is remarkable that, for example, WordNet actually has some structured technical 
descriptions (e.g., “large intestine: beginning with the cecum and ending with the rectum; includes the 
cecum and the colon and the rectum; extracts moisture from food residues which are later excreted as 
feces”). 


228 



This study also confirms the predominance of genus-differentia definitions in both general and 
specialized resources, although anatomical descriptions are more often found in Dorland’s than in, 
WordNet. 



WordNet 

Dorland’s 

Definition 

Genus-Dif. 

75% 

57% 


Meronymy 

11% 

15% 

Property 

7% 

5% 

Description 

Free-text 

3% 

8% 


Structured 

3% 

9% 

Other 

0% 

7% 

Total 

100% 

100% 


Table 5: Categories of descriptions and definitions in two different sources. 

5.2. Ontological Perspective 

In some cases, the definitions in both systems are different predicates that correspond to equivalent sets 
of objects. For example, gland may be defined as an “aggregation of cells, specialized to secrete or 
excrete materials not related to their ordinary metabolic needs”. In other cases, however the definitions 
in both systems correspond to different sets of objects. For example, “salivary glands” in Dorland’s 
include the three major glands (parotid, sublingual, and submandibular), as well as numerous small 
glands in the tongue, lips, cheeks, and palate. By contrast, in WordNet, salivary glands are “the three 
pairs of glands...”, implicitly the major ones, thus virtually excluding the minor salivary glands. In this 
example, the term in Dorland’s is generic while the term in WordNet actually corresponds to “major 
salivary glands”. 

Conclusions 

Characteristics of the definitions of terms, especially from several sources, represent valuable 
information. Among other things, this study confirmed the predominance of genus-differentia definitions 
for anatomical terms in both WordNet and specialized resources. This knowledge is expected to help 
perform the deeper analysis needed for representing the definitions in a formalism suitable to their 
comparison. 

ACKNOWLEDGMENTS 

The authors would like to thank Tom Rindflesch for his advice and useful comments. 

References 

Aronson, A. R. (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap 
program. Proceedings of AMIA Annual Symposium, 17-21. 

Burgun, A., and Bodenreider, O. (2001) Comparing terms, concepts and semantic classes in WordNet 
and the Unified Medical Language System. Proc NAACL Workshop, "WordNet and Other Lexical 
Resources: Applications, Extensions and Customizations", 77-82. 

Fellbaum, C. (1998) WordNet: an electronic lexical database (Cambridge, Mass, MIT Press). 

Harabagiu, S. M., and Moldovan, D. I. (1998) Knowledge processing on an extended WordNet. In 
"WordNet: An Electronic Lexical Database", C. Fellbaum, ed. (Cambridge, Massachusetts, MIT Press), 
pp. 379-405. 


229 



Klavans, J. L., and Muresan, S. (2000) DEFINDER: Rule-based methods for the extraction of medical 
terminology and their associated definitions from on-line text. Proceedings of AMI A Annual 
Symposium, 1049. 

Lindberg, D. A., Humphreys, B. L., and McCray, A. T. (1993) The Unified Medical Language System. 
Methods Inf Med 32, 281-291. 

Rosse, C., Mejino, J. L., Modayur, B. R., Jakobovits, R., Hinshaw, K. P., and Brinkley, J. F. (1998) 
Motivation and organizational principles for anatomical knowledge representation: the digital 
anatomist symbolic knowledge base. J Am Med Inform Assoc 5, 17-40. 

Shaikevich, A. Y. (1985) Automatic Construction of a Thesaurus from Explanatory Dictionaries. 
Automatic Documentation and Mathematical Linguistics 19, 76-89. 

Swartz, N. (1997) Definitions, dictionaries and meanings. http://www.sfu.ca/philo$. 9 .phy/swartz/ 
defmitions.htm [Dec. I, 2001]. 

UMLS (2001). UMLS Knowledge Sources, 12th edn (Bethesda (MD), National Library of Medicine). 


Olivier Bodenreider, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD, 20894, olivier@nlm.nlh ; .qoy 
Anita Burgun, Laboratoire d’lnformatique M6dicale, Avenue du Pr L6on Bernard, 35043 Rennes Cedex, France, 
Ani t a , Bu cg.UIl@ U n (V-rc DP.esl.fr 


230 



EuroWordNet as a Resource for Learning Spanish Verbs 


Roser Morante 
M Antdnia Marti 


Abstract 

In this paper we present an ongoing research for applying EuroWordNet (EWN) as a tool for learning 
the Spanish verbs in a CALL context. Current tendencies in L2 acquisition research encourage 
semantic based approaches for vocabulary teaching, which means that words have to be learned 
starting from its meaning. In this paper we present a didactic proposal for using EWN for learning and 
teaching purposes, which is based on the hypothesis that the process of lexicon acquisition can be 
facilitated by means of instruction. An instructional method should focus on helping the learner to 
establish lexical relations between the LI and the L2, to semantically connect words in the L2 lexicon 
and to learn how to use words in context. Being EWN a multilingual semantic net, it becomes a 
potential resource to be integrated in a system for CALL purposes. However, it lacks information 
about the use of words. In order to improve EWN as a learning tool, it has been enriched establishing 
links to a diatheses alternations database and to a semantically tagged corpus. 

I. Introduction 

Current tendencies in L2 acquisition research encourage semantic based approaches for vocabulary 
teaching, which means that words have to be learned starting from its meaning (Carter & McCarthy 
1988, Nation 1990). A general revision of the research carried out in the field of L2 acquisition shows 
that the acquisition of the lexicon has a very important role in the acquisition of a language. As 
Bogaards (1996:369) puts it: "It is time that language teaching and language acquisition research 
carefully consider an approach that stems from the lexicon". 

In this paper we present a didactic application of EWN as a tool for learning Spanish verbs in a CALL 
context. Following Bogaards (1996) we assume the hypothesis of Transfer in vocabulary acquisition: 
learners have a semantic memory of the LI, which knowledge can be represented as a net of 
concepts most of which are directly related to lexical items in the LI. The lexical items of the L2 can 
only be understood through the already existing knowledge of the LI. We consider that EWN can be 
taken as a model for representing this kind of lexical-semantic memory. 

As for the process of acquisition of the lexicon, we suggest that it can be facilitated by means of an 
instructional method. The method should focus on helping the learner to establish relations between 
words in the LI and L2 lexicon, and relations between words in the L2 lexicon. This method assumes 
that the learner has to deal with linguistic as well as metalinguistic knowledge(semantic relationships) 
in order to develop her own lexical-conceptual framework for all senses of a word in L2. 

There is an agreement about the fact that acquiring the lexicon implies a continuous conceptual 
development of non stable meanings. The acquisition of the lexicon consists in strengthening the 
lexical relations in the L2 mental lexicon. In order to do that, the learner has to confront the inner 
structure of the lexicon. Thus it is necessary to help learners and teachers to get familiar with it. 

Consequently findings of lexical semantics about semantic relations, polysemy, multiword 
phenomena, multilingual relationships and analysis of word senses play a central role in the teaching 
of the lexicon. Given the fact that EWN fits the guidelines defined in vocabulary acquisition research 
along the last years, we propose that EWN can be used not only as a model of semantic memory, but 
also as a resource for facilitating the acquisition of learners own lexicon. 

Learning verbs means also learning their syntactic properties in accordance with their meaning. Since 
EWN lacks detailed information about this kind of linguistic knowledge, the Spanish verbs in EWN 



are being enriched in several ways. First, we associate to each variant a set of diatheses alternations 
following the proposal in Vdzquez et al. (2000). Second, we provide each variant with a link to its 
occurrences in a semantic tagged corpus. In the corpus verbs are manually tagged with the number of 
synset in WN. Finally, each synset is enriched with a Spanish definition'. This information improves 
the use of EWN for didactic purposes. 

There are not similar resources for teaching neither the semantics of the Spanish verbs, nor its 
syntactic combinatorial behavior. Consequently this application of EWN can be very useful and 
innovative. 

2. What does it mean to learn a verb? 

First of all we will address the answer to the question what does it mean to learn a word. Summarizing 
the proposals in Richards (1976), Carter & McCarthy (1988), Swan (1997), Laufer (1997) learning a 
word means learning: 

• Its form: orthography, pronunciation and structure of the word. 

• Its meaning: limits of its reference, its different meanings and its connotations. 

• Its grammatical characteristics: the syntactic category and the syntactic structures accepted by the 
word accepts. 

• Its use: the contexts in which the word appears, its collocations and its variants of register. 

• Its place in the lexicon: the relation it establishes with other words. 

Researchers agree that the process of vocabulary learning is not only quantitative, but above all 
qualitative, and that a very important source of difficulty is the learning of the semantics of words and 
their combinatorial properties, this to say, their idiomatic use. 

Carter and McCarthy (1988) consider that learning vocabulary effectively is closely bound up with a 
teacher's understanding, and a learner's perception of the difficulties of words. For them the difficulty 
of a word may result, among others, from the relations it can be seen to contract with other words, 
from its polysemy, the associations it creates and, in the case of advanced learners, from the nature of 
the contexts in which it is encountered. Ellis (1997) states that speaking natively is speaking 
idiomatically using frequent and familiar collocations, and the job of the language learner is to learn 
these familiar word sentences. 

As for the question what does it mean to learn a verb, the most complex aspect dealt with in research 
is the acquisition of argument structure (Jufis 1996). Learning a verb means learning its semantics- 
syntax correspondences for all of its senses. Studies in lexical semantics have suggested that the 
syntactic behavior of a verb is to a certain extent determined by its meaning (Levin 1993, Pinker 
1989). Thus, learning the meaning of the verb becomes a very important reference for learning its use. 

In sum, though learning a word involves learning several types of information, the semantic aspect 
has become a crucial one. Consequently, an application is needed that contains huge amounts of 
semantic information about words. EWN being a multilingual lexical database, it seems to be an 
adequate resource to be applied to language teaching and learning. 

3. EUROWORDNET 

EWN (Vossen 1998) is a set of lexical-semantic nets of several languages, integrated and 
interconnected, whose design is based on the WordNet (WN) 1.5 (Miller et al. 90). Spanish EWN 
contains 63.000 nominal, verbal and adjectival concepts all of them connected with 1.5. WordNet 
synsets. All concepts are classified in terms of a language independent conceptual structure, ILI. 

1 This work is developed in the framework of the Senseval project, in which the research group CLiC participates as a 
provider 


232 



3.1. WN vs. EWN 


The notion of synset and the main semantic relations have been kept, although the set of relations has 
been considerably extended. An in depth description of EWN can be found in Vossen (1998). Next 
we summarize the distinguishing characteristics of EWN: 

1) The main difference is multilinguaality. It is expressed through a relation of equivalence 
(eq_synonym) between the meaning of the synsets of each language to those of WN 1.5. 

2) Lexicality. 30% of the nominal entries in WN 1.5 are not lexical, but conceptual. In contrast, 
EWN is mainly a lexical database, so that each EWN shows the idiosyncratic characteristics of its 
language. 

3) Modularity. From a language teaching perspective it can be considered as a very rich lexical 
database, since: 

• It is multilingual, with English acting as an interlingua, since it is the language used in the 
glosses of the synsets. Learners knowing English can access the relations between concepts 
and words of all the languages. 

• The gloss of the synset allows the learner to understand the conceptual information that the 
word expresses. 

• The information encoded in the semantic relations allows to leam how words relate to each 
other 

• The fact that is concept based allows to realize that not all languages have words for the same 
concept and that each language codifies the world in a different manner. 

3.2. ENRICHING EWN 

Though EWN is a rich lexical database, in its original state EWN contained little information about 
the use of words and their syntactic behavior. The Spanish WN has already been partially enriched in 
the framework of the Senseval project (Kilgarrif & Palmer 2000). As providers of the project, we 
have carried out an extension, which consists in linking the Spanish WN to a database containing the 
syntactic alternations of verbs and a semantically tagged corpus. 

Concerning diatheses alternations, the database contains the diatheses alternations accepted by each 
sense of a group of the Spanish verbs. The information in the database is based on an in depth study 
about the diatheses alternations in Spanish (Vazquez et al. 2000). For each verb information is given 
about the semantic class of the entry, the broad subcategorization structure, the diathesis it accepts, 
and the sintagmatic structure of the diathesis: 


Dar pena 

subcategorization: SN V SN SP 

example: ’La presentadora nos dio pena por su alto 

grado de candidez' (Eng.'') 

diatheses: 

dl: SN pronV(SP) SP 
kl:’se’VSNSP 
si: SN V[dejar 4- ppio/adj] SN 
tl: SN Vfhacer + inf] SN / SN hacer oc 
u4: SN V 
c5: SN 'se' V SP 
d5: SN pron V (SP) SP 
r5: SN V[estar + ppio/adj/adV] SP 
English: grieve 

Catalan: fer llfrstima _ 

Figure 1: Database entry for dar pena. 


233 



Though learners can not access this information directly, for they would not understand it, disposing 
of it is useful for preparing didactic materials and exercises. 

Disposing of a corpus is highly important, since, as Nagy (1997) points out, it is in a specific context 
where words acquire a meaning and it is finding the words in different contexts how the learner will 
be able to develop a lexical competence. In the corpus that we use, LEXESP (Sebastian et al. 2000) 
verbs have been semantically desambiguated. The disambiguation consists in assigning manually the 
number of synset to the verb The fact that verbs are linked to a synset in EWN allows the learner to 
look for occurrences of verbs in a specific sense. 


actuar#VM#l#Ponerse en accidn: hay que 
actuar#SIN:?#01341700v# 

(Eng.”) 

actuar#VM#2#Comportarse, obrar o ejercer un rol: ha 
actuado bien; desde la 

muerte de su padre, el hermano mayor actuo como 
cabeza de familia#SIN:?#00007021v# 

(Eng.”) 

actuar#VM#3#Representar o interpretar un papel en 
una obra, un tema en un 
espectdculo musical: los aclores han actuado de 
manera brillante#SIN:?#00983780v# 

(Eng.”) 

actuar#VM#4#Ejercer funciones propias de su cargo: 
el abogado actuo de 
defensor#SIN:?#00618376v# 

(Eng.”). 

actuar#VM#5#Producir efectos: la medicina actuo 
r&pidamente#SIN:?#7# 

(Eng.”) 

actuar#VM#6#Jugar en un equipo: los futbolistas 
franceses actuaron en el 
Camp Nou#SIN:?#?# 

(Eng.”) 


Figure 2: Spanish WN senses for acluar. 


En la misma situacidn que el goleador azulgrana se 
encuentran la mayoria de jugadores suramericanos que 
♦actuan* en Espafia . #6 

Hace tiempo que no *actuaba* en un tablao y tenia 
muchas ganas, puesto_que bailar en un sitio asi te 
permite improvisar y ganar en frescura , ya que 
cuando haces un montaje para una sola actuacidn estd 
todo atado y ensayado . #3 
Cardenal *actuara* contra quienes dijeron que la 

policia asesino a Geresta. #4 _ 

Figure 3: Sample of semantically desambiguated occurrences in the corpus. 


4. Exploiting EWN as a learning tool 

We assume that learning the meaning of words consists mainly in establishing hypothesis about their 
meaning, and contrasting them to several kinds of knowledge by comparing new and already known 
information. Following Appel (1996) the semantic development of the L2 is a process that starts with 


234 





a projection of the meanings of the LI into meanings in the L2 and continues with the development of 
the meaning structures of the L2. The final result would be two very interrelated systems: links 
between LI and L2 and L2 intrarelations. As the process of consolidation of the lexicon progresses 
the semantic factors become more important in the lexical organization. 

Our hypothesis is that the process of acquisition of the lexicon can be assisted by means of an 
instructional method. Stahl and Fairbanks (1986) from their comparison of several methods of 
instruction concluded that: 

a. Instruction of vocabulary is a useful complement for the learning in a naturalistic context. 

b. The methods that produced better effects in the comprehension of vocabulary where those that 
required definitions) information as well as contextual. 

e. The methods that provided a variety of knowledge about each word in several contexts had a very 
good effect on the later comprehension of texts where those words appeared. 

We assume that the information contained in EWN can help in the process of capturing w'ord 
semantics. The enrichment of EWN with syntactic and corpus information will fill the gap between 
lexical knowledge and syntactic as well as pragmatic knowledge. 

We propose a method based on the manipulation of linguistic and metalinguistic knowledge in order 
to raise consciousness about the fact that the lexicon is highly structured and that, at the same time, 
each word has a specific meaning and specific combinatorial requirements. The method has two main 
purposes. First, that the learner develops a conceptual framework for each sense of a word. Second, 
that the learner is able to use the words in context helped by her knowledge about the semantics of the 
word and her consciousness about multiword phenomena. 

In our project a didactic guide to EWN is being elaborated, which will be provided to the learner as a 
means of approaching EWN and monitoring the process of vocabulary learning. In the guide exercises 
are proposed so that the learner understands and assimilates the information that she will access. We 
are designing also an interface to allow a user friendly consult to EWN. 

The interface will allow displaying all the information related to a word. Three levels of knowledge 
(beginners, intermediate and advanced) will be defined. Each verb sense will be assigned to a level of 
knowledge The distribution of the senses by levels of knowledge is carried on applying several 
criteria: 

• Semantic specialisation: more specialised senses of verbs are assigned to higher learning levels. 
Specialisation is determined by the selection restrictions of the verbs and the meaning components 
they lexicalize. 


Level 

Verbs of transfer 

Beginners 

Llevar (Eng. 'to take, to carry’) 

Intermediate 

Transportar (Eng. 'to transport'), 
trasladar (Eng. 'to transfer’) 

Advanced 

Acarrear (Eng. 'to haul'), trajinar (Eng. 
'to cart'), arrastrar (Eng. 'to drag') 


• Prototypicality: according to Kellerman (1986) basic meanings of verbs are learned first, while 
metaphoric extensions are developed in a later stage. Some kind of basic kernel meaning is learnt first 
and that other secondary meanings will be learned in a later stage of development, as well as 
associations, the uses of a word with a specific meaning in collocations, and fixed expressions. Thus 
prototypical senses of words are assigned to the beginners level, while meaning extensions and 
metaphorical uses will be assigned to higher levels. 


235 




Level 

Senses of verb llevar 

Beginners 

Sense of transfer: 

Ayer lleve las fotografias a la editorial 
(Eng.”) 

Intermediate 

Extensions of sense of transfer: 

La comparila de teatro ha llevado su obra 
por toda Espafia 
(Eng.' ’) . 

El director de cine ha llevado la historia de 

su madre a las pantallas 

(Eng.”) 

Esta empresa nos va llevar a la ruina 
(Eng.”) 


• Uses of the verb that require complex syntactic structures are related to higher levels. 


Level 1 Senses of verb llevar 

Beginners 

Prototypical alternation of transfer: 

Ayer lleve las fotografias a la editorial 
(Eng.”) . _ 

Intermediate 

Dative alternation: 

Nos Ilevo mucho tiempo hacer los 

ejercicios 

(Eng.”) 

Advanced 

Causative alternation + infinitive clause: 

Las palabras de la ministra de educacidn 
me llevan a pensar que no modificaran la 
ley universitaria 
(Eng.”) 


In order to validate these criteria experiments have already been carried out (Morante & Diaz 1999). 
The experiments showed that factors like selection restrictions, degree of lexicalization and 
grammatical characteristics intervene in the degree of difficulty that verb .senses pose to learners. The 
factor saliency has proved to be a main factor in determining the learning of verb senses. It explains 
the easiness of learning some of the prototypical uses (those being frequent), but also of some 
lexicalized expressions. This could explain why some nonprototypical uses of the verb llevar , as for 
example those of possession, were easily decoded by learners. 

In the development of the lexicon each lexical unit has its own history of processing. We propose two 
stages of apprenticeship, whose steps are reflected in the didactic guide for approaching the Spanish 
WN: 

1. Developing of the prototypical senses of words and basic relations. This stage is important, 
since starting from the prototypical sense of the word other senses are to be developed. It is carried 
out by means of plain queries to get the hypemyms, hyponyms, synonyms and antonyms of a verb 
sense, the semantic class and the definition of each synset. The learner is encouraged to compare the 
same type of information for verbs in the LI and verbs in the L2 and to extract conclusions about how 
little portions of the conceptual world are encoded. 

1.1. First contact with the word. In this stage the most prototypical meaning is learned. For that the 
learner establishes correspondences between the word in the L2 and a word in the LI. The gloss in 
English helps the learner to understand the concept expressed by that word. 


236 




1.2. Basic relations with other words. The learner investigates which other words are related to this 
one by means of the basic relations of antonymy, synonymy and hyponymy. 

1.3. Basic use and frequent expressions related to that word. The learner is encouraged to access 
the examples of use in the corpus, so that she gets familiar with the basic syntactic structure in which 
it appears and with some frequent expressions. The learner is encouraged to approach the data in a 
creative and reflexive way, by means of exercises. 

2. Expansion of the meaning of the word. The learner access the extensions of meaning of the word. 
Once the form and the prototypical use of the word are consolidated, other meanings will be learned. 
At this stage the learner not only establishes relations between words in the LI and in the L2, but also 
between words and concepts in the L2. *• 

2.1. Comparison of the prototypical meaning of the word with the new meaning being learned. 
The learner is encouraged to consolidate the conceptual framework of the verb by comparing the 
conceptual content of all the senses. 

2.2. Construction of hierarchies of verbs by means of the relations of hyponymy and 
hypernymy. The learner builds by hand her own WN in the L2 for a reduced number of verbs. This 
allows her to get acquainted with the structure of the lexicon. Again, once the learner has built the 
hierarchy in the L2 starting from a verb, she compares it with the hierarchy in the LI starting from the 
same verb. 

2.3. Extension of the syntactic knowledge for one verb. Once the learner has a broad semantic 
knowledge, she is guided through the occurrences of the verbs in. the corpus, so that she herself 
discovers the syntactic behavior of verbs. Since the learner already knows different meanings, she is 
invited to group occurrences of a corpus by senses and to compare their different syntactic behavior. 
This is done through a set Of exercises, which gradually introduce the learner in the analysis of 
linguistic data. 

2.4. Extension of the syntactic knowledge for a class of verbs. Once the learner has learned several 
verbs and she is familiar with their basic syntactic behavior, she is invited to compare them in order to 
detect differences and similarities. 

Through all the stages EWN is used as a very useful source, where the information has to be 
discovered gradually in a creative way, since it requires an act of searching. While searching, the 
learner is unconsciously processing information that will help her develop her lexicon in the L2. 

5. Conclusions and Future Research 

In this paper we have presented an application of EWN as a tool for learning vocabulary. Specifically, 
it will be used for learning Spanish verbs. Until now there are no similar resources for learning 
Spanish. The main advantage of this application is that it combines an electronic source of data with 
research in second language acquisition and research on the lexical semantics of verbs. 

The application has been designed taking into account the current tendencies in vocabulary teaching, 
and starting from a hypothesis about the process of vocabulary acquisition (Hypothesis of Transfer). 
The modelling of the access to EWN by levels of knowledge is supported by experiments with L2 
learners, in order to define factors of difficulty in the learning of verbs. 

At the present stage of the project we are enriching the Spanish WN with more senses of verbs. In the 
initial stage of development we have dealt with high frequency and basic vocabulary verbs (like 'put*, 
'give', 'go', etc.), which have many senses and many idiomatic uses. We are also developing the 
interface of access to EWN and the guide for the learner. Once this tasks have been finished, a process 
of testing will be carried out. 


237 



References 


R. Appel. (1996) The lexicon in second language acquisition. In: P. Jordens & J. Lalleman, Ed., 
Investigating Second Language Acquisition , Mouton de Gruyter, New York, 1996, 381-403. 

P. Bogaards. (1996) Lexicon and grammar in second language learning. In: P. Jordens & J. Lalleman, 
Ed., Investigating Second Language Acquisition. Mouton de Gruyter, Berlin, New York, 1996. 

R. Carter & M. McCarthy (Ed.) (1988) Vocabulary and Language Teaching. London, Longman, 1988. 

R. Ellis. (1997) SLA Research and Language Teaching , Oxford University Press, Oxford, 1997. 

A. Juffs. (1996) Leamability and the lexicon: theories and secon language acquisition research, John 
Benjamins, Philadelphia, 1996. 

E. Kellerman. (1986) An eye for an eye: Crosslinguistic constraints on the development of the L2 
lexicon. In: E. Kellerman and M. Sharwood Smith (eds.) Crosslinguistic influences in second language 
acquisition. Pergamon, New York, 1986, 35-47. 

A. Kilgarrif & M. Palmer. (2000) Special Issue of the Journal Computers and the Humanities (34(1-2)), 

2000 . 

B. Laufer. (1997) What's in a word that makes it hard or easy: some iniralexical factors that affect the 
learning of words. In N. Schmitt, & M. McCarthy, Eds., Vocabulary: description, acquisition and 
pedagogy. CUP, Cambridge, 1997. 

B. Levin. (1993) English Verb Classes and Alternations. University of Chicago Press, Chicago, 1993. 

G. Miller, R. Beckwith, C. Fellbaum, D. Gross, K. Miller. (1990) Five Papers on WordNet. CSL Report 
43. Cognitive Science Laboratory. Princetown University, 1990. 

R. Morante & L. Diaz. (1999) The acquisition of semantics-syntax correspondences: llevar as a verb of 
transfer in Spanish. Communication presented at the EUROSLA'99 Conference. University of Lund, 
1999. 

W. Naggy. (1997) On the role of context in first- and second- language vocabulary learning. In N. 
Schmitt & M. McCarthy, Ed., Vocabulary: Description, Acquisition and Pedagogy, CUP, Cambridge, 
1997. 

P. Nation. (1990) Teaching and learning vocabulary. Hemic & Heinle, Boston, MA., 1990. 

S. Pinker. (1989) Leamability and Cognition: the Acquisition of Argument Structure, MIT Press 
Cambridge, MA, 1989. 

J. Richards. (1976) The role of vocabulary teaching, TESOL Quarterly 10 (1), 77-89, 1976. 

N. Sebastian, A. Marti, M. F. Carreiras, F. Cuetos G6mez. (2000) Lexesp, lexico informatizado del 
espanol. Edicions de la Universitat de Barcelona, 2000. 

S.A. Stahl & M.M. Fairbanks. (1986) The effects of vocabulary instruction: A model-based meta¬ 
analysis. Review of Educational Research 56: 72-110. 

M. Swan. (1997) The influence of the mother tongue on second language vocabulary acquisition and 
use. In: N. Schmitt & M. McCarthy (eds.) (1997) Vocabulary: Description, Acquisition and Pedagogy. 
Cambridge: CUP, 156-180. 

G. Vazquez, A. Fem&ndez & M.A. Marti. (2000) Clasificacion verbal. Alternancias de diatesis, 
Quadems de Sintagma, 3, Universitat de Lleida, 2000. 

P. Vossen, Ed. (1998) EuroWordNet: A Multilingual Database with Lexical Semantic Networks, Kluwer 
Academic Publishers, Dordrecht, 1998. 


Roser Morante, University of Barcelona, Department of Linguistics, morante@lingua.fil.ub.es 
M. Antdnia Marti, University of Barcelona, Department of Linguistics, amarti@lingua.fil.ub.es 


238 



The WordNet as a Vocabulary Management Tool For 
Indexing Language 


Hemalata Iyer 
B.A.Sharada 


Abstract 

This paper reports an exploratory study of the ways in which WordNet can be used in conjunction 
with the indexing languages, such as the faceted classification schemes and thesauri to enhance 
information search and retrieval. Compares the features of indexing languages and natural language. 
Suggests the application of the WordNet, for broadening and narrowing down the searches, for 
providing wider map of knowledge by linking domain thesauri, for updating the faceted schemes, for 
full text searches and as a pre-search tool for lay-users. 

Introduction 

The present day information environment is characterized by globalization, the breaking down of 
disciplinary boundaries, interdisciplinary and multidisciplinary research, accessibility of full text 
documents and information in other formats to the public in general via the internet, etc. The 
information handling tools such as classification schemes and thesauri are widely used for organizing 
the content of documents in information retrieval systems represented by index strings. It is herein 
that the WordNet comes into play. This paper explores the ways in which the WordNet can be used in 
conjunction with classification schemes and thesauri to enhance retrieval (Iyer and Sharada 2002). 

1. WordNet for Information Search and Retrieval 

The following are ideas of the many ways in which the WordNet can be used to enhance information 
search and retrieval. 

• Enhance Classification schemes and update them with hierarchic terms. 

• Provide a wider map of subject for the user. Different senses of a term listed in the WordNet can 
be used to map into other disciplinary thesauri. 

• As a pre-search tool to aid understanding and identification of concepts, and the terminology 
employed in domain specific bibliographic and visual resources databases. 

• Use hypemyms to broaden and hyponyms to narrow down your search 

• Use for full text searching. 

2. Approach to the Study 

For this exploratory study, titles drawn from the ERIC database, 2000-2001, were facet analyzed into 
PMEST structure using the Colon Classification (CC), seventh edition (Ranganathan, 1989). This 
scheme based on the Generalized Facet Structure proposed by Dr. S.R. Ranganathan, centers round 
the idea of five fundamental categories - Personality (P), Matter (M), Energy (E), Space (S) and Time 
(T) that are semantic universal, to which all concepts in any domain can be reduced. The sample 
concepts for which no match was found in the CC were checked in the WordNet. Out of 1470 words, 
1030 matched with CC schedule. To update the schedule the remaining 440 words had to be 
incorporated. 

3. Comparison Between NL and IL 

The facet analysis and grammatical analysis of the sample titles revealed similarity and differences 
between natural language (NL) and indexing language (IL). The Table 1 presents their comparison 
[Annexure 1] and Table - 2 presents an example of the analysis. [Annexure 2]. 



4. Uses of WordNet in IL 


4.1. The WordNet to Enhance Classification Scheme 

WordNet can be used both to update the scheme, and to enhance its index for improved access. The 
latter is especially relevant in the digital environment. 

Steps for Introducing Terms in the CC Schedule: 

a. Convert the NL terms into conceptual representation. The concepts must be in nominative case 
singular number and as for as possible single term representation. The concept representation will 
be in controlled vocabulary. 

b. Determine the fundamental category (facet) to which the concept belongs. 

c. Identify the right location within the hierarchically organized terms in the facet schedule. While 
doing so, apply the theoretical principles, such as, the characteristics of division, their sequencing, 
the citation order, principle of inversion, etc. 

The WordNet terms were incorporated into the schedule and also in the index to enhance access. For 
each of the terms that needed to be added to the CC Education schedule, the WordNet was consulted 
to determine its grammatical category (GC) and it’s meaning. Nouns for the most part corresponded 
with the [P] facet terms, verbs with the [E] facet terms and adjectives and nouns with the [M] facet 
terms. The overall match between the facets and the GC was utilized in determining the appropriate 
facets to which the terms belonged. If any of these terms were not listed in the controlled vocabulary, 
the Library of Congress Subject Headings (LCSH), then the WordNet terms were added to the 
classification schedule. Introducing terms in their appropriate array ahd chain in the schedule 
hierarchy was also an issue and this was also facilitated by the hypemyms, hyponyms and the 
meanings provided in. the WordNet. If the terms were available in LCSH, then the WoidNe; terms 
were utilized to provide additional access by adding them to the Schedule index and by providing 
cross-references. Table 3 presents examples of WordNet terms selected updating the schedule 
[Annexure 3]. 

4.2. WordNet to Link Terms in the Thesauri for Wider Mapping 

Classification schemes are essentially discipline based, and compartmentalize knowledge by 
disciplines. The different senses of the terms in the WordNet can be used to link to other 
topics/sections/contexts within a single thesaurus or to other disciplinary thesauri. In effect, with the 
aid of classification schemes, the WordNet can be used as a switching language to multiple domains/ 
different sections of a single thesaurus. 

The term Development that appears in the CC Education schedule is presented as an illustrative 
example. The first step is to draw the multiple senses of the term from the WordNet. The word 
Development [Noun] in the WordNet is listed under eight senses. They include: 

1. The meaning such as Improving, enlarging and expanding (Industrial Economics) 

2. The meaning such as Evolution (Biology) 

3. Growth, Growing and Maturation as the result of development (Biology/Medical 
Sciences/Ontogenesis). 

4. The meaning Exploitation (Agricultural Economics) 

The senses such as Improvement and Evolution, each representing a different meaning of the term 
Development can be linked to the corresponding ERIC thesaurus term records. The term records 
contain the hierarchic terms Broader Terms (BTs), Narrower Terms (NTs) and non-hierarchically 
associated Related Terms (RTs). When thus linked by chaining, a wider map of the topic/term 
emerges before the users through the hyperlinks, [as in Annexure 4] 


240 



4.3. The WordNet as a Pre-Search Tool for Lay Users 


In this age of worldwide communication, the domain specific specialized vocabularies (thesauri) used 
in the visual resources and bibliographic databases poso limitations of access. The WordNet as an 
auxiliary resource may help alleviate this problem to some extent. In a pre-search preparatory process, 
the users can search the WordNet to clarify their idea, meanings of the terms and address other 
vocabulary issues. This will help to develop a pre-search model for user assistance. Towards this end, 
the WordNet pre-search behavior for accessing visual images database in the field of Art and 
Architecture is being examined. (Iyer and Keefe) 

4.4. Use Hypernyms to Broaden and Hyponyms to Narrow Down the Search 

In this context, contextualizing the term in hierarchic structure serves to disambiguate the term. In IL 
hierarchical genus-species relation that is represented as BT and NT is of limited use from the user 
perspective and also because not all terms, especially the abstract concepts, fit into that structure. 
Therefore, the WordNet hierarchies of hypernyms and hyponyms can serve as a complement to the IL 
hierarchies and can be used to broaden and narrow.down the searches. Table-4 presents a comparison 
of the hierarchies in WordNet and the ERIC thesaurus for the term ‘ Vocabulary ’ [Annexure 5]. 

The word Vocabulary in the WordNet has three senses. The semantic tree in Table-4 in this Annexure 
is derived from primarily the second sense, together with the relevant meanings drawn from the first 
and third senses. 

Unlike the WordNet that .presents a broad general schema, the NTs in the ERIC thesaurus primarily 
list names of specific domain vocabularies. The LCSH takes a linguistic perspective and lists terms 
such as words, language etc. . . . * 

The WordNet sets the term in the context of Faculty of language, Verbal communication that can be 
considered as the upper links. The term Vocabulary branches out into Lexicon and Mental lexicon. 
The Lexicon being seen as an Artifact, Object and an Entity, while the latter as a process of Cognition 
and therefore as a Psychological feature of mental life. 

The WordNet hierarchy and the IL hierarchies present very different semantic structures and both are 
necessary and relevant for navigation through information. 

4.5. Use for Full Text Search 

The Internet, the fulltext and multilingual databases, global users, and multidisciplinarity constitute 
the information world today. In such an environment both NL and IL controlled vocabulary searches 
are necessary. The RTs and the equivalent terms in thesauri are inadequate to meet the needs of NL 
search terms. In the 90s the need to provide vocabulary for frill text searches was recognized and this 
resulted in a thesaurus of search terms and synonyms (Knapp 2000). With the influx of fulltext 
databases and the Internet, newer methods to augment such resources needs to be devised and newer 
vocabulary tools developed. 

The WordNet is a promising source. The following example lists the terms from the three sources 
(ERIC thesaurus, CC and the WordNet) for the term Reading. As is evident from the example, the 
WordNet has wider coverage of the term. The term Read appears as a [Noun] [Verb] [Adj], and [N] 
has one sense, [V] has 10 senses, and [Adj] has one sense. The term Reading appears as [Noun] and 
[Verb], and [N] has 8 senses, and [V] has 10 senses. Both the terms Read [V] and Reading [V] share 
the same 10 senses. 

The word Reading from the CC and Eric thesaurus is presented together with the terms from the 
WordNet. [Annexure 6] From Table 5 it is evident that along with the controlled RTs and the 
equivalent terms, the WordNet can be used as a source of NL terms for fulltext searching. The 


241 



WordNet covers the entire gamut of meanings beginning from the cognitive process of understanding 
a text, to the functional reading aspect, such as meter reading, etc. There is very little overlap between 
the IL terms and the WordNet terms and therefore drawing terms from all these three sources for full 
text searching may be helpful. 

5. Conclusions 

There is a growing need to improve, and refine ways to provide user access to the immense web of 
information and the primary means to that is to chart the universe of knowledge through semantic 
structuring. In summary, this study affirmed the benefits of using the WordNet in conjunction with the 
information handling tools in enhancing search, retrieval and development of these tools. 

References 

Iyer, Hemalata. (1995) Classificatory Structures: Concepts, Relations and Representation. Index 
Verlag, Germany. 

Iyer, Hemalata and Jeanne M. Keefe, (forthcoming) The WordNet as an Auxiliary Resource to Search 
Visual Image • Database in Architecture. [Paper to be presented at the Seventh International 
Conference of the International Society of Knowledge Organization, July 10-13, 2002, in Granada, 
Spain] 

Iyer, Hemalata and Sharada, B.A. [To be published] The Role of WordNet in Information Retrieval. 
Knapp, Sara D (2000) The Contemporary Thesaurus of Search Terms and Synonyms: A Guide for 
Natural Language Computer Searching. Phoenix, Ariz, Oryx Press. 656 p. 

Ranganathan, S.R. (1989) Colon Classification , 7 th edition Bangalore: Sarada Ranganathan 
Endowment for Library Science. 

Sharada B.A. (1997) Transformation of Natural Language into Indexing Language: Kannada-a case 
study. University of Mysore Ph.D. Dissertation, Mysore, Karnataka, India. 232 P. 


242 



ANNEXURE 1 


Sl.No. 

Features 

Natural Language 

Indexing Language 

01. 

Objective 

Semantic 

Semantic and Sequence of Concepts 

02. 

Structure 

Grammar 

Facet Syntax 

03. 

Analysis 

Grammatical 

Postulational 

04. 

Transformation 

Behavioral 

Postulational and Hierarchical 

05. 

Representation 

Natural Language 

Artificial Language 

06. 

Modelling 

Structural 

Hierarchical 

07. 

Lexicon 

Dictionary Based 

Taxonomic/Thesaurus Based 


Table 1: Natural Language and Indexing Language 


243 










ANNEXURE 2 


TitleChristmann, Edwin P.; Badgett, John L. A Meta-Analytic Comparison between the Assigned 
Academic Achievement Levels of Students Assessed with Either Traditional or .^Alternative 
Assessment Techniques. 2000 r * 

Facet Analysis 

Education [BS],Students [IP]; Achievement > Academic Achievement > Assigned Academic 
Achievement [1MP1]; Level[lMP2]: Assessment^] - Traditional Method(Speciator) Comparison > 
Meta-Analytic Comparison (Phase Relation)Assessment [E]- Traditional method(Speciator) 

WordNet analysis 


[N] = Noun 
[V] = Verb 
[Adj] = Adjective 
[Det] = Determiner 
[Adv] = Adverb 


Grammatical Analysis 

The phrase structure of this title according to transformational grammar would be 

[[[NP] ][det,A ][[adj.ph][adj, Meta+adj,Analytic]]][[NP[N,Comparison][prep ph 
][prep,between(conjunction marker)][[[NP[det,the][[Adj ph][adj, Assigned+adj,Academic+adj, 
Achievement]]][[NP][N, Levels] [[prep ph][prep,of][[NP][N,student][[NP][N,who (relative pronoun 
optimally deleted)] [[NP[adj , Assessed][[prep ph][prep, with][[NP][N,(Either...or discontinuous 
coordination marker)] [[npl ,Traditional][[np2][adj,Ail erna ii ve+a 4)> 

Assessment+n,Techniques]]]]] ]]]]]]]]]]] 

Table 2: Facet Analysis, WordNet analysis and Grammatical Analysis 


Meta-Analytic 
Comparison [NJ 
Between [Adv] 
Assigned [V] 
Academic [N] 
Achievement [N] 
Levels [N] 
Students [N] 
Assessed [V] 
Traditional [Adj] 
Alternative [Adj] 
Assessment [N] 
Techniques [Nj 


244 




ANNEXURE 3 



WORDNET MEANING 
WITH G C 

CONTROLLED 

VOCABULARY 

(LCSH) 

• 

FUNDAMENTAL 
CATEGORIES 
[P] [M] [E] 

Term added 
ToSchedule(S) 
/To index(I) 

Aboriginal 

Native Australian [N] 

Australian 

[p] 


Advisors 

Consultant [N] 

aboriginals 

[P] 

Consul tant(S) 

• Analyze 

Study/examine [V] 

Consultant 

[E] 

Analyze(S), ' ■ 



Content analysis 


Study(I),and 



(Communication) 


Examine(I) •• 

Augmented 

Added/made greater [Adv] 


[M] 

t. 

Career 

Calling/vocation [N] 

Augmented 

[P] 

Vocational, 



feedback 


Guidance(S), 



Vocational guidance 


Calling(I) 

Intervention 

Intercession/interference 


[E] 

Vocation (I) 

Mentor 

[NJ 


[p] 

Intercession(S) 


Wiseman/trusted guide, 

Not available (NA) 


Mentor(S), 


adviser [N] 

Mentor 


Adviser(I), 

Multitude 


teachers/master 

[p] 

Guide(I) 

Scores 

Battalion/large number [N] 

teachers 

[E] 

Battalion(S) 

Suitable 

Get a certain score [V] 


■ - [M] 

Score(S) 

Suspicion 

Proper/appropriate [Adj] 

NA 

[M] 

Appropriate(S) 

Vocabulary 

Intuition/mistrust [N] 

NA 

[M] 

Intuition(S) 


List of words/lexicon [N] 

NA 


Vocabulary (S), 



NA 


Lexicon(I) 



Vocabulary 




Table 3: Updating Schedules and Index: WordNet, LCSH Terms and the Facets 


245 










ANNEXURE 3A 


Terms introduced in the appropiatc hierarchies in the CC schedule 


1. Advisors entered in the Schedule of Personality isolate 

(By teachers) 

Consultant 

2. Analyze entered in the Schedule of Energy Isolate 

(Methods based on group learning) 

(By Study method) 

Content Analysis (Communication) 

The WordNet terms Study, and Examine were added to the index with a cross-reference to Content 
Analysis (Communication). 

3. Career entered in the Schedule of Personality isolate 

(By Courses/Programs) 

Vocational guidance 

The WordNet terms Calling and Vocation were added to the index with a cross-reference to the term 
Vocational Guidance. 


246 



ANNEXURE 4 


Wider map of the topic/term 



DEVELOPMENT 


Improving 

1 


Evolution 

1 

Term: Improvement 


▼ 

Term: Evolution 

1 

▼ 

NT: Achievement Gains; 
Educational Improvement: 

Facility Improvement: 

Finance Reform: 

Neighborhood Improvement: 
Program Improvement; 

Reading Improvement: 

Urban Improvement; 

Speech Improvement: 

Teacher Improvement; 


1 

BT: Development: 

NT: Heredity: 

RT: Anatomv: Astronomy: 
Biodiversity; 

Biological; 

Influences: Biology: Botany: 
Creationism: Cytology: 

Ecology: Embryology: 

Student Improvement: 

Ethology: Genetics: 

Paleontology: Phvsiologv: 

Writing Improvement 

RT: Achievement: Change Strategies: 

Development: Improvement Programs: Innovation: 

Pretests Posttests: Satisfaction: Success 

Sociobiology; Zoology: 

By providing links to each of the RT terms, the users can be presented with the term 
Development/Evolution/improvement etc., occurring in different contexts and domains. The user can 
further have access to term records for the RT terms and search across domain boundaries imposed by 
the classification schemes. 


247 







ANNEXURE 5 


Human Mind 

i 

Faculty of language 

i 

Verbal Communication (language & speech) 
Vocabulary 


r 

Lexicon (list of words) 


Mental Lexicon (Psychological features) 


r 

Perception 


Cognition 


\ 

Knowledge 


Reasoning 


J e 


Learning 


In contrast the term record for vocabulary lists the following subordinate terms in LCSH. 
Vocabulary 
BT Diction 

BT Lexicology ! 

NT English language 
NT Language and languages—words 
NT Word recognition 
NT Words,New 

The Term record in the Eric thesaurus read as follows: 

NT Aviation Vocabulary 
NT Banking Vocabulary 
NT Basic Vocabulary 
NT Chemical Nomenclature 
NT International Trade Vocabulary 
NT Jargon 
NT Keywords 
NT Kinship Terminology 
NT Mathematical Vocabulary 
NT Medical Vocabulary 
NT Sight Vocabulary 
NT Subject Index Terms 
NT Word Lists 


Table 4: Semantic tree for the term Vocabulary 


248 



ANNEXURE 6 


Reading 

The ERIC thesaurus terms are in italics, the CC terms indicate the FC within brackets and the 
remaining terms in bold are from the WordNet. 

Ability 

Achievement[M] 

Aloud 

Comprehension[M] 

Coaches 

Collaborative strategic 
Development[M] 

Disinterest[M] 

Frequency 
Functional 
Tax forms 
Signboards 
Meter 
Receipts 

Any non-academic 
Habit[M] 

Kit 

Lifelong 

Literary forms • 

Poetry 

Ploy 

Story ' 

Manuals [M] 

Models [M] Models 

Outcome 

Skills [M] Skill 

Speed 

Tool 

Therapy / biblio-counseling 
Tutors 


Table 5:Term representation in ERIC thesaurus, CC and the WordNet for the term 


249 



Cognitive process of understanding 

Interpretation 

Performance 

Intention 

Version interpretation 

Recitation, recital-public instance of reciting 

Say 

Look at, interpreted and say out loud 

SCAN-OBTAIN DATA FROM MAGNETIC TAPE 
Interpret the significance of... 

Learn, study, be a student 
Register, show, 

record-indicate a certain reading 
Hear and understand 
Make sense of a language 
Having been read 
Predict 

Written material 


Hemalata Iyer, Associate Professor, School of Information Science and Policy, University at Albany, SUNY, 135 
Western Avenue, Albany, NY 12222, USA, hi651@albany.edu 

B.A.Sharada, Assistant Librarian, SRLC, Central Institute of Indian Languages, Manasagangotri, Mysore, India 
570 006. sharada@cil1.stDmv.soft.net . drsharada@sancharne_t,|p 


250 



Bulgarian Wordnet as a Source for (Psycho) Linguistic Studies 


Krassimira Petrova 
Toma Nikolov 


Part -1: First Steps to Bulgarian WordNet 
Abstract 

In this paper we present the first attempt to build semi-automatically a core of Bulgarian WordNet for 
nouns. Our contribution to the methods of building WordNet is using elementary sets as an 
intermediate source for synsets and a set of validation functions for sorting the list of candidates for 
proper synsets. Some suggestions for BWN future enlargement and improvement are made. 

Introduction 

The HLT (human language technologies) area is one of the major areas for research and development. 
Presently, it is necessary to create electronic linguistic resources for Bulgarian and other Slavonic 
languages. It has been obvious that each language needs such a lexical database since the WordNet 
(WN) became available in the early 1990s. 

I. About Bulgarian Wordnet (BWN) 

The four major morphosyntactic categories (noun, verb, adjective, and adverb) are treated separately 
in WN. According to psycholinguistic investigations (Miller 1990:14) the nouns are hierarchically 
organized as an inheritance System. Therefore the nouns are the core (the central, basic, most 
important part) of a WN because of their major role in human mental lexicon organization. Thus the 
nouns are our starting point in building the BWN. 

1.1. The task to build a core of nouns for the BWN which is collaboration between Sofia University 
and Bulgarian Academy of Sciences, consists of the following principal steps (outlined in Nikolov, 
2000 ): 

a) Studying the structure of a WN as a lexical database system; 

b) Comparison between different theories for converting structures of a certain type into another 
language; 

c) Providing in electronic format the necessary input from the bilingual English-Bulgarian and 
Bulgarian-English dictionaries; 

d) Providing the WN database in the so-called Prolog type. This step includes: 

Conversion of the data base (extracted from dictionaries) into the Prolog format; 

Modification of the existing main algorithm for converting synsets; 

- Creating proper input data for the main algorithm; 

- Program realization of the main algorithm. 

Our final result is a core of nouns for the BWN. It has been evaluated by a linguist who considered the 
corresponding input data (their volume, correctness, and completeness). Some suggestions for a 
future automatic and semi-automatic development, manual correction, and enlargement of the BWN 
were made. 

1.2. Available Resources 

Two different kinds of input were used for an automatic creation of the BWN: the program code of 
the original WN and the bilingual English-Bulgarian (E-B) and Bulgarian-English (B-E) dictionaries. 
We started with the most important WN lexicon category: the noun. This meant more than 116,000 



entries in the lexical matrix which represents the mapping between approximately 66,000 word 
meanings (synsets) and 95,000 word forms (in WordNet 1.6.). Our resources were restricted from two 
points of view: first, we removed the synsets with proper names, compounds, and a small set of 
artificial collocations. As a result we obtained 54,524 entries in the lexical matrix, 36,594 word 
meanings (synsets) and 36,034 word forms. Second, we depended on the partially available input 
from the E-B and B-E dictionaries in an electronic format. Additional input data were scanned, 
manually sorted, and adapted (in form) for the purpose of our work. The extracted bilingual dictionary 
of nouns consists of approximately 10,400 English nouns in its E-B part, and 5,150 Bulgarian nouns 
in it§ B-E part. 

The final preliminary step was combining and intersecting the two starting resources (the WN 
database and the bilingual dictionary) in order to extract only the nouns. We organized this output into 
synonym sets. From 10,400 English nouns in the bilingual dictionary and 36,000 word forms in WN 
we created a set of 9,585 nouns to work with. 

1.3. The algorithm 

1.3.1. Most algorithms for an automatic extraction of synsets and hierarchical relations from the WN 
work with single words (Harabagiu 1999; Vossen 1999). 

In the process of the semi-manual adapting of the input resources we had a list of English words 
(EW) and Bulgarian words (BW) which correspond to each other. 
eword bwordl; bword2, bword3; bword4 

In this example eword is an EW, while bwordl, bword2, bword3 ; and bword4 are its corresponding 
Bulgarian translation equivalents (according to the E-B dictionary). A semicolon separates different 
meanings of a given eword. A comma separates Bulgarian synonyms bword2 and bword3 , which 
refer to one and the same meaning of the eword. They both must be included in the Bulgarian synset 
corresponding to the second meaning of eword. We named such small proto-synsets ‘elementary sets , 
or ‘e-sets’, bwordl and bword4 also form e-sets, although these e-sets consist only of a single word. 
Collocations (nominations in more than one word) have been excluded. Thus the most general form of 
the above example is: 

eword e-set 1 {bwordl}; e-set 2(bword2, bword3}; e-set 3{bword4} 

The e-sets are the most important source for our task. The use of e-sets in the process of building new 
WN we consider to be our contribution. 

1.3.2. Creating a Bulgarian synset for a specific concept requires several steps of automatic 
extraction 

a) Creating a list of the EWs which form a synset in the WN; 

b) Generating a list of all possible Bulgarian e-sets which correspond to these EWs; 

c) If any word of an Bulgarian e-set can be translated into a word of the English synset using the 
Bulgarian-English dictionary, the whole Bulgarian e-set is moved to the list of candidates. It is 
possible to have no candidates if none of the BWs has a translation in the original synset. In such 
cases we include in the list of candidates all Bulgarian e-sets (if such exist). It is even possible to 
have no e-sets at all. If so, there are BWs (one or more) which can be translated into one or more 
words of the original English synset using the Bulgarian-English dictionary (the existence of such 
words is a necessary condition to include an English synset in the input data). Each BW must be 
treated as a single e-set and included in the list of candidates. 

When completed, this list of candidates is our most important preliminary result. The real Bulgarian 
synsets must be compilations of some e-sets belonging to this list. We have developed a set of 
evaluating functions, which sort the extracted e-sets and outline the most adequate list of candidates 
(described in details in Nikolov, Petrova 2001). This output is flexible, as the evaluating functions can 
be used in different configuration and order. 


252 



The algorithm we use is flexible due to the heuristic rules and the technical restrictions of the 
computers The functions with bigger numbers are not only more complex but also need much more 
calculations to be done. That is why we use these functions only in the cases where the solution is 
difficult. Also the different combinations of functions refer to different aspects from the 
psycholingusitic point of view. 

2. Validation, evaluation and Some Suggestions for future work 

2.1. A linguist has to validate the Bulgarian synsets (i.e. to create the correct one from the sorted list 
of candidates), because it is not possible to work 100% automatically dealing with linguistic 
resources. In most cases the expert must simply choose one of the top-listed e-sets (usually the first 
one), unite .two or more e-sets, or make a subset of them. The average number of e-sets in the: list of 
candidates is 1.9. Over 90% of the lists of candidates consist of one or two e-sets. 

2.2. In order to test the developed algorithm we extracted a representative sample of randomly chosen 
e-sets (approximately 1000 entries) from the list of candidates. For each concept and/or meaning we 
compared the English synsets to the corresponding Bulgarian ones. The specific procedure of 
evaluation is described in details in Nikolov, Petrova (2000). As a reference for this comparison wee 
used the glossary (the definition of a concept) and the latest bilingual English-Bulgarian dictionaries. 
If necessary, we corrected the Bulgarian synset. It proved out that for over 98% of the English synsets 
we easily found the appropriate Bulgarian synsets. We consider this result very satisfactory. The 
remaining 2% of the synsets did not have correct Bulgarian counterpart mostly because of wrong or 
missing data in the bilingual dictionaries. 

To summarize, the principles of creating the synsets are quite adequate to the linguistic modeling of 
synonym strings in dictionaries. They also correspond to the psycholinguists principles of semantic 
organization of the concepts in our memory, which resemble the basic principles of the WN and are 
used in the algorithm. 

2.3. Suggestions for future work. Adjusting the Bulgarian WN to the Princeton WN and to the EWN 
depends on both theoretical work and the available (in an electronic format) lexical database. 
Therefore we have to elaborate the theoretical studies and create or adapt (in form) additional lexical 
data for the purposes of the BWN. 

It is necessary to build up a Bulgarian lexical database for multilingual information retrieval via the 
Inter Lingual Index (see Vosscn 1999:10). We can use the available monolingual dictionaries in order 
to improve and enrich the BWN (see for references to the dictionaries and examples Nikolov, Petrova 
2001 ). 

2.3.1. Taxonomy operators from the explanatory dictionary definitions could be applied in building 
the lexical ontology. These operators (i.e. “X is a part of Y’\ “X is a kind of Y”) are useful because 
they explicitly show the place of a given concept in the hierarchy with respect to the other concepts in 
the ontology. 

2.3.2. Synsets for nouns in the BWN and synonym strings from the synonym dictionaries should be 
similar or coincide. Increasing the volume of a given synset is possible using data from the Synonym 
dictionary for Bulgarian (or any other language). 

2.33. One has to consider and describe (on morphological and derivational level) the regular 
typological features in Bulgarian and English cross-language parts-of-speech (the ways of translation 
of the gerund, the homonymy of adjectives and present, past participles), to follow the typical 
derivational models for certain parts of the speech, for example rich system of verb prefixes: 

coin(109541569, [yield, takings, take, proceeds, issue, return], [Bp-bmaHe, BT>3BpT>maHe, 3aBpT>ma»e]). 


253 



2.3.4. Regarding word formation, we point out that various suffixes for nouns which are derived from 
verbs and denote actions, events, and states, are available in Bulgarian. However, they are not 
presented in the electronic database we used. We miss a lot of such nouns as a result of our 
incomplete dictionary input. This affects the completeness of Bulgarian synsels, corresponding to 
some English gerunds and nouns. This problem is solvable by using a new “Word Formation 
Dictionary of Contemporary Bulgarian”. 

Ex.coin( 100707799, [interchange, exchange], [pa3MHHa, pa3MeH»He, oOMHHa, o6MeH*He]). 

2.3.5. The synsets in the Bulgarian WordNet could be modified with dictionary data from the 
Dictionary of Association Norms for Contemporary Bulgarian and the preliminary materials for the 
Bulgarian volume of Slavonic Association Thesaurus available in an electronic format The 
organization of the semantic memory, presented in the associative fields to the stimuli, resemble the 
psycholinguistic principles of organizing the words in the ontology. 

2.3.6. Including some new loan words in the Bulgarian synsets depends on whether or not they are 
included in Bulgarian explanatory dictionaries and/or in the dictionaries of loan words. Such wor s 
have undergone different stages of adaptation and become a part of the lexical system ot 
contemporary Bulgarian. 

2.3.7. To conclude, we try to analyze if the possible differences between the desired and the obtained 
Bulgarian synsets are due to the input data or due to the mechanism of obtaining them when building 
the BW. A high level of independence between these two factors is desired. We have a so tned to 
suggest some methods of automatic or semi-automatic development of the core for the BW with the 
perspective of maintaining the multilingual lexical database. 


CONCLUSIONS 

The initially built core for nouns for BWN is a starting point of creating the new lexical database for 
Bulgarian. Semi-automatic algorythm of creating the BWN is validated by the evaluation functions, 
needs some manual adapting. 

References 

Fellbaum C. (1998) WordNet: An Electronic Lexical Database. Ed. Cristiane Fellbaum, The MIT 

Press, Cambridge, London, England, 1998. u^wqq- 

Harabagiu S. (1999) Lexical Acquisition for a Romanian WordNet Presented on Eurolan99, 

http://www.seas.smu.edu/~sanda . f 

Miller G.A.(1990) Nouns in WordNet: a Lexical Inheritance System. In: International Journal o 

Lexicography 3 (4), Revised August 1993 - accessible at 

ftp://ftp.cogsci.p rinc eton.edu/Dub/wordnet/5Dapers.ps . 

Nikolov T. (2000) Msepaotcdane Ha ndpo sa cbtqecmeumejmu UMena sa 6-bJizapcKU WordNet { 
Building a Core for Nouns for Bulgarian WordNet). diploma thesis, Sofia University, Department of 

Mathematics and Informatics. . ... , - 

Nikolov T. and Petrova K. (2000) Building and Evaluating a Core of Bulgarian WordNet for Nouns. 

In “OntoLex'2000. Workshop on Ontologies and Lexical Knowledge Bases. Supported by Bu garian 
Academy of Sciences”. Sept. 8-10, 2000: Sozopol, Bulgaria, (under print). 

Nikolov T. and Petrova K. (2001) Towards Building Bulgarian WordNet - Euroconference Recent 
Advances in Natural Language Processing. Proceedings. Ed. G.Angelova, K.Boncheva, R.Mitkov, 

N Nicolov NiNikolov. Tzigov Tchark, Bulgaria, 5-7 September 2001, pp.199-203. _ 

Vossen P. (ed.) (1999) EuroWordNet. General document. Version 3, Final, July 20, 1999. - available 
at http://www.hum.uva.nl/~ewn 


254 



Part - 2: Bulgarian Wordnet as a Source for (Psycho) Linguistic Studies 
Abstract 

The lexical items for “Time” and “Periods of time” from WordNet and newly built core for nouns for 
Bulgarian WordNet are used as a source for cross-language comparison of concepts, lexemes, and 
lexical-semantic relations in Bulgarian and English. 

1. The task to build a core of nouns for the Bulgarian WordNet (BWN) is collaboration between 
Sofia University and Bulgarian Academy of Sciences (described in Nikolov, Petrova 2000, 2001a). 
The final result is a core of 9,585 nouns for the BWN. It has been evaluated by a linguist who has 
considered the corresponding input data (their volume, correctness, and completeness). Some 
suggestions for a future automatic and semi-automatic development, manual correction, and 
enlargement of the BWN were made (Nikolov, Petrova 2001b). 

2. This paper is in the field of comparative lexical semantics and psycholinguistics based on the use 
of WN as a lexical database source. In the cross-language description of the lexicon of some 
languages we choose a methodology which goes through several stages: from analysis of single 
lexical items to the comparison of systematic objects (lexico-semantic (LSG) and thematic groups 
(TG)) (see Petrova 1996). It was demonstrated that the results are much more precise, detailed, correct 
and complete when we compare semantically related lexical units. 

3. Our goal in this paper is to illustrate the use the Princeton WN and the initially built part of the 
BWN for the purposes of a psycholinguistic research of lexical microsystems. 

3.1. The general purpose is to try to verify the hypothesis of gradual falling down of the coefficient 
of semantic closeness (CSC) within the LSG (on the material of the LSG “Time” in English and 
Bulgarian). The CSC is a part of the formal and semantic analysis of pairs of equivalents across the 
considered languages, i.e. LI and L2. It is calculated by the formula according to the constructed 
semantic definitions based on dictionary and experimental dala. 


2c 

y - coefficient of semantic closeness (CSC) 

y =_ 

2c - doubled number of common-shared meanings by the words in LI and L2 

a + b 

a - the total number of all meanings of the word in LI 


b - the total number of the all meanings of the word in L2 


We came up to such hypothesis: the CSC falls gradually down in the LSG with field structure f om 
their name (identificator) through the nuclear units to the periphery in closely related languages, as 
Bulgarian and Russian (see Petrova 2000). This hypothesis was proven for the LSG ‘Time” and 
“Space”, visually presented in histograms. 

3.2. Now we are interested in the question if this hypothesis is relevant to languages from different 
families, as English and Bulgarian. Can we use the ontology , the items and their relations in WN for 
such (psycho)linguistic studies? 

We try to reformulate the hypothesis, our starting point, for the WN. We come to the observation that 
the more abstract the lexical items are (as names of the groups, basic nodes in the ontology), the more 
often they coincide in different languages. Therefore the higher elements are language independent. 
The lower in the hierarchy the items are, the more differences appear, so these elements are more 
language dependent. There are some lexical items with a very high CSC (coincidence or high 
overlapping of their semantics) on the boundaries with another groups. We cannot apply the 
quantitative method for CSC in Bulgarian and English but we try to make qualitative intuition 
observations. 


255 



3.3. The object of comparison is the “branch” of the ontology for ‘Time’ from the original WN and 
respectively from the core for nouns in the BWN. It seems interesting to us, that the ‘Time’ group in 
English-Bulgarian thematic dictionary (following the same organization as Roget’s thesaurus) and the 
WN lexicalized concepts for time are organized in a different way. The ‘Time’ forms a compact group 
in the chapter “Time and Space” in the thematic dictionary. 609 entries for time, hierarchically 
organized, are represented in WN 1.6. But they are distributed under two different mother concepts — 
that of‘Time” as one of 25 basic categories, and the ’’Period of time” in ‘Quantum, quantity, amount, 
measure’. This fact is due to the principles of the design ‘bf WN which combines features of both 
types of lexical reference resources - a traditional dictionary and thesaurus (see Fellbaum' 1998:7). 
The “key-word” in the dictionary is “time”, whereas in WN the “time” could be viewed as an attribute 
to the “key-word” period, to the idea of measurement. 

There are 77 nouns for time and periods of time in the Bulgarian counterpart due to the adopted 
restrictions in the process of manual adapting of input data from the original WN, and second, due to 
the small volume of input bilingual English-Bulgarian dictionaries in an electronic format. 

It is significant for the analysis that the lexical items are represented by their meanings but also by 
their various lexical relations, referring to the basic relations in WN. 

3.4. We try to describe the analogous fragments of English and Bulgarian “language picture of the 
world” which represents in signs the corresponding segment of “conceptual picture of the world” 
(according to the terminology used by B.Serebrennikov, 1987, and the following series of books). We 
find out some common features and differences in that particular fragment of the lexical systems (as 
lexical items and relations) of English and Bulgarian as a reflection of conceptual-semantic relations 
which link concepts. “Looking into” mental lexicon, mental organization through lexical or linguistic 
organization and the glossaries as their representation is possible even the relation is indirect and 
conditional, as they “does not coincide neatly” (see Fellbaum 1998:9). 

3.5. According to the relation between concepts, their lexicalization and additional metaphorical 
components of meaning we can distinguish different degrees of similarity and diversity between 
lexical items. As a whole the structure of both “branches” for ‘Time’, and especially for ‘Periods of 
time’ in English and Bulgarian are similar, because the meanings of the basic lexemes are very close, 
and all coordinate terms are similar. 

3.5.1. Some microstructures (i.e. the days of the week, months, days of saints, some Jewish holidays, 
etc.) are given as lists and contain brief encyclopedic information. They can be useful in foreign 
language teaching/ learning in respect to the history, culture and traditions of the country. Most of this 
information is missing in the initially built part of BWN but it is a matter of future enlargement. For 
example, some national specific information could be extracted form WN, and could be very useful in 
(self)education, studying the language and the history of certain country: 

=> Pennsylvanian, Pennsylvanian period, Upper Carboniferous, Upper Carboniferous period — 
(from 280 million to 310 million years ago; warm climate; swampy land) 

=> Mississippian, Missippian period, Lower Carboniferous, Lower Carboniferous period — (from 
310 million to 345 million years ago; increase of land areas; primitive ammonites; winged 
insects) 

=> Atlantic Time - (time of the 60th meridian, used in Puerto Rico, the Virgin Islands, Bermuda, and 
the eastern edge of Canada) 

' => Eastern Time, Eastern Standard Time, EST -- (time of the 75th meridian, fifth time zone west of 
Greenwich, used in the eastern U.S.) 

=> • Central Time, Central Standard Time, CST - (time of the 90th meridian, used in the central U.S.) 

3.5.2. In the following case we can see the coincidence of the concept and its lexicalization in English 
and Bulgarian (in contrast with the lexical gap in Russian where such concept and lexeme lacks): 


256 



-> afternoon - (the part of the day between noon and evening; "he spent a quiet afternoon in the 
park") - Bulgarian cjiedobed 

3.5.3. There is more defined, small-grained concept in English, which is missing in Bulgarian, and 
respectively, the lexeme is missing: 

=> midafternoon — (the middle part of the afternoon) 

3.5.4. The derivational model with prefix “mid-“ is very productive in English but it is not in 
Bulgarian (except for the corresponding words for midday, midnight ): 

=> midterm — (the middle of the gestation period) 

=> mid-January — (the middle part of January) t r 

=> mid-February — (the middle part of February),etc. 

-> midwinter - (the middle of winter) 

3.5.5. In some cases English and Bulgarian traditions are similar, for example celebrating the name 
day; (the tradition is especially popular in Bulgaria). 

=> saint's day — (a day commemorating a saint) 

=> name day — (the feast day of a saint whose name one bears) 

3.5.6. Even if the concept and the lexeme are the same, the “inner form”, the metaphor, underlying the 

collocation can be different. Synestesy, the “hidden image” in English expression is “vertical”, 
Bulgarian one is “horizontal” - xpauno epeMe - literally “the time on the edge” (cf. “lexical functions 
ofMel’chuk and Zholkovski): - ^ • 

=> time - (a suitable moment; "it is time to go") 

=> high time - (the latest possible moment; "it is high time you went to work") 

Adding and describing the figurative language and cases of regular polysemy is a further perspective 
for development of the WN (see Peters 2001). 

3.5.7. The metaphorical meanings can be similar in both languages: 

=> youth, early days - (an early period of development; "during the youth of the project") 

-> dawn - (an opening time period; "it was the dawn of the Roman Empire") 

But as it seems to us, Bulgarian follows “more symmetrical” way of nomination than in English - the 
same concept is represented by the antonym of dawn - 3ane3 'sunset': 

=> evening — (a later concluding time period; "it was the evening of the Roman Empire") 

3.5.8. Similar people’s experience and wisdom could be “sublimated”, shown up in different language 
units: “period of time” expression in English, and saying or a proverb in Bulgarian (“An hour sleep 
before midnight is equal to two hours’ sleep after midnight”.) 

=> beauty sleep -- (sleep before midnight) 

3.5.9. In some cases the word in Bulgarian is missing because the concept is missing (the case of 
lexical gap, and also a “conceptual gap”). It can be compensated on another level of nomination using 
an explanation. 

=> fortnight, two weeks - (a period of fourteen consecutive days; "most major tennis tournaments 
last a fortnight") 

3.5.10. Analogical example for missing concept, respectively word in English for “24 hours’ period of 
time” versus the concept and a single word in Bulgarian — denonou\ue (den+o+Houj+uj+e = 
day+interfix+mg/i/+suffix+flexion). 

3.5.11. Two different models in English - a concept, expressed by a collocation, and a concept, 
expressed by a single word, has only one corresponding in Bulgarian - these concepts does not refer 
to a single word but only to a collocation: 


257 



=> growing season -- (the season during which a crop grows best) 

=> baseball season — (the season when baseball is played) 

=> seedtime — (the time during which seeds should be planted) 

=> sheepshearing — (the time or season when sheep are sheared) 

=> preseason — (a period prior to the beginning of the regular season which is devoted to training 
and preparation) 

3.5.12. There are some cases when the concept is represented by a single word in English but by a 
collocation in Bulgarian, so it is missing in BWN: 

=> weekday — (any day except Sunday (and sometimes except Saturday)) 

3.5.13. On the other hand, some loan English words are “adopted” in Bulgarian and they can replace 
the collocation in Bulgarian, so BWN can be enlarged: 

=> weekend — (usually Friday night through Sunday) - Bulgarian yunend, nouuenu dnu 

3.5.14. A difference that is usually pointed out on the earlier stages of studying English is the fact that 
the first day of the week for Englishmen is Sunday, for Bulgarians is Monday (although both 
countries are Christian). 

=> Sunday, Lord's Day, Sun -- (first day of the week; observed as a day of rest and worship by most 
Christians) 

3.5.15. The expression can be "strange” for non-native speakers because of totally antonymous 
associations underlying the idiom. Contrary to English expression, the Bulgarian phrase is Kyueuixu 
cmyd ‘beastly, bitter cold weather’. 

=> dog days, canicule, canicular days -- (the hot period between early July and early September; a 
period of inactivity) 

3.6. Thorough close look to the “branch” for “Time” has made it possible to find out some wrong 
synsets instead of lexical gaps in BWN. They appear because of missing concept in Bulgarian. The 
Bulgarian synset is a mirror translation of the English synset but for another meaning, respectively 
gloss: “on era of existence or influence; "in the day of the dinosaurs"in the days of the Roman 
Empire"in the days of sailing ships" => day - deu.denonou^ue 

(British) a week at British universities during which side-shows and processions of floats are 
organized to raise money for charities => rag - napyan.dpuna 

The comparison between English synset, the Bulgarian one and the gloss will be needed manual 
correction of the obtained data for BWN. 

Conclusions 

WN as a lexical database is very useful and convenient, handy in the purposes of comparative studies. 
It provides similarly structured lexical objects and concepts to point out common features and 
differences in cross-language analysis of the lexical systems and the mental lexicon. The possibility to 
compare fragments from different languages connected through Inter Lingual Index in 
EuroWordNet(see Vossen 1999) completes other (psycho)linguistc methods. The hypothesis of 
gradual falling down of the coefficient of semantic closeness in the LSG, or semantic fields in closely 
related languages (as Bulgarian and Russian) could be applied also to other languages (as Bulgarian 
and English). We can observe coincidence or very high similarity between items high in the hierarchy, 
and more differences on lower levels of the ontology. The initially built core for nouns for BWN with 
the perspective to be completed, enlarged and elaborated, and the other morphosyntactic categories to 
be elaborated, is a restricted but useful base for comparative analysis. 


258 



References 


Fellbaum C. (1998) WordNet: An Electronic Lexical Database. Ed. Cristiane Fellbaum, The MIT 
Press, Cambridge, London, England, 1998. 

Nikolov T. (2000) HsepaDtcdane na ndpo 3a cbufecmeume/mu imena 3a btmapcKU WordNet (= 
Building a Core for Nouns for Bulgarian WordNet), diploma thesis, Sofia University, Department of 
Mathematics and Informatics. 

Nikolov T. and Petrova K. (2000) Building and Evaluating a Core of Bulgarian WordNet for Nouns. 
In “OntoLex'2000. Workshop on Ontologies and Lexical Knowledge Bases. Supported by Bulgarian 
Academy of Sciences”. Sept. 8-10,2000: Sozopol, Bulgaria, (under print). 

Nikolov T. and Petrova K. (2001a) A Core of Bulgarian WordNet for Nouns. - Hunam and Computer. 
Verbal communication and interaction via computer. Ed.T.Slama-Cazacu. 9-th International 
conference of GRLA-RWCAL, Bacau-Tescani, Romania, 26-29 April 2001, Editura Europolis, 
Constanta, pp.279-296 

Nikolov T. and Petrova K. (2001b) Towards Building Bulgarian WordNet - Euroconference Recent 
Advances in Natural Language Processing. Proceedings. Ed. G.Angelova, K.Boncheva, RMitkov, 
N.Nicolov, N.Nikolov. Tzigov Tchark, Bulgaria, 5-7 September 2001, pp. 199-203. 

Peters W. (2001) An Exploration of Figurative Language Use in WordNet. In “Corpus Lingusitics 
2001, Lancaster University (UK)”, 30 March - 2 April 2001; University Centre for Corpus research 
on Language Technical Papers Volume 13 - Special issue. Proceedings of the conference. 
http://www.comp.lancs.ac.uk/ukrel/cl2001.html 

Petrova K. (1996) Comparative semasiologic analysis of lexico-senamtic groups in closely related 
languages (in Russian and Bulgarian). - Sofia unpublished PhD thesis. 

Petrova K. (2000) Coefficient of Semantic Closeness in Lexyco-Semantic and Thematic Groups of the 
Same Name in Closely Related Languages. In “Papers from the third Conference on Formal 
Approaches to South Slavic and Balkan Languages”. Plovdiv, September 24 - 26, 1999. Ed. By Mila 
Dimitrova-Vulchanova, Iliana Krapova, Lars Hellan. University of Trondheim. Working papers in 
Linguistics. 2000/volume 34, pp.286-295. 

Serebrennikov B.(1987) Pojib nejioeeuecKoeo (paxmopa e sobtxe. fobix u xapmuna Mupa. (= Role of 
the human factor in language. Language and the world picture). Moscow, 1987. 

Vossen P. (ed.) (1999) EuroWordNet. General document. Version 3, Final, July 20, 1999. - available 
at http://www.hum.uva.nl/~ewn 

Krassimira Petrova, Department of Slavonic Studies, Sofia University “St. Kl. Ohridski", 15 Tzar Osvoboditel 
Blvd., Sofia, Bulgaria 1540, krasi@slav.uni-sofia.ba 

Toma Nikolov, Linguistic Modelling Laboratory, Bulgarian Academy of Sciences, 25A Acad. G. Bonchev Str., 
Sofia, Bulgaria 1113, toma@lml.bas.ba 


259 



Lexicons in an Object-Oriented Grammatical Model 
For Universal Grammar-Based Machine Translation (UGBMT) 

Yukiko Sasaki Alam 
Shahid Alam 

1. Introduction 

This paper describes an ongoing work on designing an object-oriented model for machine 
translation and building the lexicons. The technology of object-oriented programming, in 
particular the rich library of classes and programming principles Java offers, provides a 
convenient tool to conceptualize the process of machine translation. To our knowledge, there 
have been very few publications so far on object-oriented natural language processing ([1], 
[8], [10], [11], [12]). The present model takes advantage of object-oriented programming 
technology. 

This model also draws on advancements in linguistic research. In particular, it assumes that 
language has two levels of structures, Surface Structure and Deep Structure, as proposed by 
proponents of Transformational Grammar ([2]) and Case Grammar ([5]). Furthermore, by 
incorporating the premise that in spite of idiosyncrasies exhibited in individual languages, 
there are uniformities of universal scope ([6]), the model houses information common to all 
languages in Universal Grammar components. In addition, the model postulates that there is a 
level of structure deeper than Deep Structure that can be called Universal Structure to 
represent language-independent deep understanding of sentences. 

The Lexicon in Universal Grammar and the lexicons in individual grammars differ in many 
respects. While the indexes of the lexicons of individual languages are lists of surface forms 
of words, the index of the Lexicon in Universal Grammar is a list of meanings or senses. The 
Universal Lexicon contains information commonly found in individual languages, but the 
lexicons in individual languages hold information on meanings, morphology, and 
idiosyncratic properties of words. Meanings are points of reference to get deeper 
understanding of linguistic constructs. Order of constructs at each level of structure is based 
on such functional units as heads, modifiers, complements and specifiers. 

This paper first shows the architecture of this model, then the functions of each level of 
structure, and finally the roles of the lexicons in the process of translation. 

2. Universal Grammar and Language-Specific Grammars 

This object-oriented grammatical model includes Universal Grammar and Lexicon as well as 
language-specific grammars and lexicons. It assumes that an English-Japanese and Japanese- 
English translating agent has knowledge of Universal Grammar, English and Japanese. The 
three different knowledge areas are modeled in the three different Java packages, 
universalgrammar , english and Japanese. The package universalgrammar stores in its Java 
classes, prototypical information such as prototypical syntactic categories, verb classes and 
semantic features innately associated with entities and events. This architecture not only 
contributes to economy of the model, but also enables an easy conceptualization of translation 
process. However, this approach also exposes the formidable task of distinguishing the 
universal from the language-specific. It is not a simple task no doubt, but once successful, 



grammar writers of individual languages will no longer need to know grammars of target 
languages, but grammars of their own languages. 


3. Surface Structure, Deep Structure and Universal Structure 


This model is designed on the premise that deep understanding of language is represented at 
Universal Structure via Surface Structure and Deep Structure. The translation process 
proceeds from the Surface Structure of a sentence in a source language to the Surface 
Structure of the corresponding sentence in a target language via Deep Structures and 
Universal Structure. 



Fig. 1: Three levels of sentence representation 

As illustrated below, Surface Structure of a sentence is composed of word forms labeled with 
the syntactic categories. 



Fig. 2: Surface Structure of the English sentence They did not win the game. 

On the other hand, Deep Structure of a sentence is composed of the meanings and functions 
of the word forms, as illustrated in Fig. 3. Note that the capitalized words in the root nodes of 
the Deep Structure stand for meanings, as opposed to lowercased words representing word 
forms in the root nodes of the Surface Structure. 



Fig. 3: Deep Structure of the English sentence They did not win the game. 


261 






























By virtue of the Deep Structure of a sentence, the Universal Structure obtains more general 
semantic information such as thematic roles of the arguments of the verb (e.g., AGENT and 
GOAL) and the aspect type of the verb (e.g., ACHIEVEMENT). An example of Universal 
Structure is given below. 



Fig. 4: Universal Structure for the sentence They did not win the game. 

Notice that all the three structures are built up of functional categories such as heads, 
complements, specifiers and modifiers on the basis of assumption that components of each 
structure are organized according to functional roles they play in a sentence or a phrase. 

4. Role s of The Lexicons 

This section focuses on the roles played by the lexicons of individual languages and the 
Lexicon of Universal Grammar. By referring to the lexicon of the source language as well as 
the grammar, the Surface Structure of a sentence is obtained. For instance, the package 
english includes the class EnglishSentence that extends the class Sentence in 
universalgrammar , thus inheriting the attributes and methods from its super class. Both 
classes are illustrated below: 


Sentence 


EnglishSentence 

lexicon, sentence, modifier, specifier, 



head, subject 


englishlexicon, position_at_eng_s 


Fig. 5: The class Sentence in universalgrammar and the class EnglishSentence in english 
The package english also includes the class EnglishNp , which extends the class Np in 
universalgrammar , thus inheriting the attributes and methods from its super class. Both 
classes are illustrated below: 


Np 


EnglishNp 

modifier, specifier, head 


englishlexicon 


Fig. 6: The class Np in universalgrammar and the class EnglishNp in english 
Thus, the value of englishsentence.specifier, head is the String they , and the value of 
englishsentence.head.complement.complement.head is the String win. By referring to the 
englishlexicon , such information as the meaning, function and morphology of each word form 
is obtained. An entry in the englishlexicon may be defined in the following way: 

insertLexicalEntry (“win”, new LexicalEntry (new LexicalSubEntry (VMORPH1, 
new UsageList ( 

new EnglishTransitive (“WIN_GOAL”, ((“OBJSEMCAT”, "COMPETITION”)), 
new EnglishTransitive (“WIN_POSS”, null) 

.••))); 

Fig. 7: Lexical entry for the English verb win 


262 


























The above lexical entry includes two meanings of win. * The first meaning appears in a 
sentence like They won the game , and the second in a sentence like He won an Oscar for his 
screenplay. The last argument of EnglishTransitive is a list of idiosyncratic properties of the 
verb win in the meaning of WIN_GOAL, requiring in this case that the semantic category of 
the object be COMPETITION. This requirement overrides the general assignment in the class 
EnglishTransitive that objsemcat = verbclass.objsemcat, which means that the semantic 
category of the object of a verb is obtained not from the lexical entry for each verb, but from 
the verb class of the verb contained in universalgrammar. This prevents redundancy of 
information while capturing generalizations of some properties of verbs. 

To reach Deep Structure, the correct meaning must be chosen, in this example, between 
WIN_GOAL and WIN_POSS. For that purpose, the semantic categories of the arguments 
play an important role. The immediate reference for distinction is made to a list of 
idiosyncrasies, which is the last argument of each transitive usage above: the usage in the 
meaning of WIN_GOAL requires that the semantic category of the object be 
COMPETITION. The object argument of the sentence in question is battle, whose meaning in 
the englishlexicon is listed as BATTLE. To look up the semantic category of the meaning 
BATTLE, a search runs down the universallexicon , and locates the lexical entry for 
BATTLE, which contains the semantic category as COMPETITION. Thus, WIN_GOAL is 
selected as the meaning of win in They did not win the battle} It should also be noted that the 
semantic category of a word is not listed in a lexical entry in the lexicon of an individual 
language, but in the lexical entry for the meaning in universallexicon . 1 2 3 

Meanings of word forms such as WIN_GOAL and THEY are replaced with surface forms at 
Surface Structure in the target language. The Universal Lexicon contains information on 
surface forms for meanings, as illustrated in an entry for WIN_GOAL: 
insertLexicalEntry (“WIN_GOAL” 

new LexicalEntry ( new UnivLangDefinition (GoalOrientedVerbClass), 
new LanguageDefinition (“english”, “win”), 
new LanguageDefinition (“japanese”, “kat”), 

)); 

Fig. 8: Lexical Entry for WIN_GOAL in univerallexicon 
5. WordNet, EuroWordNet and Levin (1993) 

As demonstrated above, reference to semantic categories of the arguments is imperative to 
disambiguate polysemous verbs. 4 [9] lists 25 unique beginners or semantic categories for 
noun source files, noting need of cross-referencing in some cases. The list of the semantic 
categories is not sufficient for this model, but this, together with the subsets for nouns listed 
in [3], can be a basis of the list of semantic categories required by this model. Semantic 
categories in [9] are selected according to the study of possible adjective-noun combinations. 
The list of semantic categories for nouns must be enriched in this model from the need to 
disambiguate polysemous verbs as well as nouns. The task is currently under way. 

1 [7] does not list win in the meaning of WIN_GOAL. However, the fact that win in the above two meaings 
appear in two different case frames in Japanese will endorse the current treatment. 

2 If the object of this sentence were game, the situation is more complex. As game has such meanings as 
GAME_FIGHT and GAME_ANIMAL, only the context can disambiguate the meaning of the sentence. 

3 The idea behind this arrangement is that irrespective of different word forms in individual languages, the 
meaning should have the same semantic category, so the information should be contained in Universal Grammar. 

4 According to [4], English verbs are approximately twice as polysemous as nouns. 


263 



Senses distinguished in WordNet and EuroWordNet can be a source of meanings that are 
indexes of Universal Lexicon in this model. [13] lists eight different senses of the noun board 
such as goveming_board, board of directors, advisory_board, school_board, circuit_board 
and diving_board. Each sense or meaning in this model must be entered in the lexical entry 
for the noun board in the lexicon of english while the Universal Lexicon should have entries 
as many as the number of senses for board. 

The current model also benefits from research of verb classes in [7], although the lists of verb 
classes differ. As demonstrated above, verb classes also play a pivotal role in this model. A 
.verb class is a Java class, so that verbs of this verb class can obtain common semantic 
categories of the arguments from the verb class. For instance, the verb win in the meaning of 
WIN_GOAL belongs to the verb class GoalOrientedVerbClass, which has the subject whose 
thematic role is AGENT and the object whose thematic role is GOAL. The semantic category 
of the subject is ANIMATE or ORGANIZATION that can be AGENT, and the semantic 
category of the object is ENTITY. The class GoalOrientedVerbClass is illustrated below: 
final public class GoalOrientedVerbClass 5 extends Transitive { 
private String subjthemrole, objthemrole, evcntaspect; 
private Class subjsemcat, objsemcat; 

public GoalOrientedVerbClass () { ;• . • 

subjthemrole = "AGENT”; . ‘ 

objthemrole = "GOAL”; 

subjsemcat = ANIMATE || ORGANIZATION; , 

obj semcat = ENTITY; 
eventaspect - "ACHIEVEMENT”; 

}” ' 

... . . . #• 

Fig. 9: class GoalOrientedVerbClass in the package universalgrammar 
All the verbs in this class have the AGENT subject and the GOAL object, together with the 
above semantic categories for the subject and object, unless the lexical entries for the verbs 
contain information on idiosyncratic properties. 

6. Conclusion 

This paper has presented an ongoing work on designing an object-oriented grammatical 
model for Universal Grammar-based machine translation (UGBMT). This model benefits 
from recent development in object-oriented programming and linguistic research including 
development of WordNet and EuroWordNet. This model is composed of Java packages 
representing Universal Grammar and language-specific grammars. Linguistic entities of 
universal nature such as the lexicon of meanings, verb classes and semantic categories 
commonly found in individual languages are implemented as Java classes in 
universalgrammar. In addition, prototypical syntactic categories and syntactic verb classes 
such as sentence, noun phrase, intransitive verb, and transitive verb are implemented as 
classes in universalgrammar. Classes representing language-specific linguistic entities extend 

5 The super class Transitive contains methods setting and getting the values of the variables such as objthemrole 
and objsemcat. As the class Transitive is a subclass of Verb that has methods getting and setting the values of 
subjthemrole, subjsemcat and eventaspect, it inherits these methods from its super class Verb, and the class 
GoalOrientedVerbClass , in turn, inherits all these from its super class Transitive. 


264 



classes representing prototypical linguistic constructs in universalgrammar , and inherit 
attributes and methods from the super classes, so that each of them does not have to hold the 
same information, but only pointers to it. The pointers in this model are meanings in lexical 
entries in individual lexicons while they are the index of the Universal Lexicon. This 
architecture enhances simplicity and transparency in organization. 

Understanding of sentences is represented at Universal Structure via Surface Structure and 
Deep Structure. A source sentence is parsed into Surface Structure with syntactic categories 
and word forms, then into Deep Structure with meanings and functions of word forms, and 
finally into Universal Structure with richer semantic information. Generation of a 
corresponding sentence in a target language proceeds on the reverse via Deep Structure and 
Surface Structure. What mediates the process from source language to target language is 
meaning. 

This model also takes advantage of universal nature of* functional categories such as heads, 
specifiers and complements that show a uniform order at each phrasal level. The architecture 
of this model not only gives rise to an intuitively transparent mechanism of translation, but 
results in economy in programming. 

References 

[1] Baija, M. L., N. Paton, A. A. A. Fernandes, M. H. Williams and A. Dinn. 1994. An effective 
deductive object-oriented database through language integration. Proc. of the 20th VLDB 
Conference. 

[2] Chomsky, N. 1965; Aspects of the Theory of Syntax. MIT Press, Cambridge, MA. 

[3] Climent, Salvador, Horacio Rodriguez and Junio Gonzalo. 1996. Definition of the links and 
subsets for nouns of the EuroWordNet Project, http://www.hum.uva.nl/~ewnydocs.htm 

[4] Fellbaum, Christiane. 1990. English verbs as a semantic net. International Journal of 
Lexicography 8: 281-303. 

[5] Fillmore, C. J. 1969. The case for case. In E. Bach and R.T. Harms (eds.), Universals in 
Linguistic Theory, 1-90. Rinehard & Winston, New York. 

[6] Greenberg, J. H. 1966. Universals of language, 2nd edn. MIT Press, Cambridge, MA. 

[7] Levin, Beth. 1993. English verb classes and alternations A preliminary investigation. The 
University of Chicago, Chicago. 

[8] Li, L. and B. R. Bryant. 1998. An efficient parsing model for unification categorial grammar with 
object-oriented knowledge representation and selection sets. International Jr. on Artificial 
Intelligence Tools 7:143-162. 

[9] Miller, George A. 1998. Nouns in WordNet. In Christiane Fellbaum (ed.), WordNet An Electronic 
Lexical Database, 23-46. The MIT Press, Cambridge, MA. 

[10] Miyoshi, H. and K. Furukawa. 1985. Ronri programming gen go ESP ni-okeru Object shikoo 
koobun kaiseki. In V. Dahl and P. Saint-Dizier (eds.), Shizengo rikai to ronri programming , 132- 
146. Kindaikagakusha, Tokyo. 

[11] Neuhaus, P. and U. Hahn. 1996. Restricted parallelism in object-oriented lexical parsing. 
COLING-96 Proceedings Vol. 1 , 502-507. 

[12] Saint-Dizier, Patrick. 1994. Object-oriented logic programming for natural language processing. 
In P. Saint-Dizier (ed.), Advanced logic programming for language processing, 211-252. 
Academic Press, London. 

[13] Voorhees, Ellen M. 1998. Using WordNet for text retrieval In Christiane Fellbaum (ed.), 
WordNet An Electronic Lexical Database, 285-303. The MIT Press, Cambridge, MA. 

♦Yukiko Sasaki Alam is at Department of Digital Media Science, Hosei University, Tokyo, and 
Shahid Alam is at NextCard, USA. Contact e-mail address is sasaki@k.hosei.ac.jp. 


265 



Semantic Based Text Mining 


D.Manjula 
P.Malliga 
T. V. Geetha 


ABSTRACT 

This paper discusses the incorporation of semantics in the various phases of Text Mining. Text 
mining is the process of extracting implicit, previously unknown and potentially useful 
information from textual documents. Domain Concept is also needed in information extraction 
because document collections are in linguistic, domain specific and application levels. The 
WordNet is a lexical database which does not have the domain knowledge in detail. The domain 
knowledge is introduced in the process of text mining to improve the retrieval performance. The 
interlinked domain concept trees are to be created for the domain knowledge. 

Keywords 

Knowledge Discovery, Information extraction, Text mining, Natural Language Processing. 
Introduction 

Information extraction (IE) systems analyse unrestricted text in order to extract specific types of 
information from a language processing perspective. IE systems must operate at many levels, 
from word recognition to sentence analysis and from understanding at the sentence level on up to 
discourse analysis at the level of the full text documents. The goal of information extraction is to 
extract from the documents salient facts about prespecified types of events, entities or 
relationships. These facts are then usually entered automatically into a database, which may then 
be used to analyse the data for trends, to give a natural language summary or simply to serve for 
on-line access. Text mining is a current research area which includes Information retrieval, 
Information extraction, Text analysis, Text categorization, Text summarization, Text clustering, 
Database Technology and Visualization. 

Text Mining 

Two techniques of text mining are associative extraction and prototypical document mining. The 
associative extraction process doesn’t produce any exploitable results. A different approach is 
necessary when full text is considered. The prototypical documents are informally defined as 
corresponding to an information that occurs in a repetitive fashion in the document collection. 
The underlying working hypothesis is that repetitive document structures provide significant 
information about the textual base that is processed[l]. 

The various phases of the text mining as follows: 

1. Indexing the document collections. 

2. Preprocessing: This involves removal of tags, removal of stop words and word stemming, 
pos tagging and term extraction. 

3. The third step is generating frequent term sets. Many methods available are primarily 
statistical in nature and focus on word frequency to determine the concept within a document. 

4. Next is clustering of term sets based on a similarity measure derived from the number of 
common terms in the sets. 



5. Finally the domain-specific module identifies information in a particular document associated 
with the obtained clusters. 

Incorporation of Semantics: Using wordNet 

The major problem with purely statistical methods is that they do not account for context. 
Specifically, finding the aboutness of a document relies largely on identifying and capturing the 
existence of not just duplicate terms, but related terms as well. This concept, known as cohesion 
links semantically related terms which is an important component in a coherent text[2]. The 
WordNet is the lexical resource which contains information about the morphologically related 
words. The WordNet is used as tool for finding the semantic similarities in inter and intra 
documents[7]. The opposite approach is to attempt the semantic understanding of the source 
document. The problem with such approaches is that a detailed semantic representation must be 
created and a domain specific knowledge base must be available[3,4]. 

Moris and Hirst first introduced the concept of lexical chains[5,6]. Lexical chains represent the 
lexical cohesion among an arbitrary number of related words. Lexical chains are created by using 
the WordNet’s lexical knowledge. On the other hand the use of semantics can lead to better 
representation of the document. In this paper we have introduced the concept of chaining based 
on semantics called semantic chains. Semantic chains can be created by identifying set of words 
that are semantically related. Using semantic chains in text mining would be efficient because 
these relation are easily identifiable within the source text with out the use of vast knowledge 
bases. The semantic chains are created with lexical and domain knowledge. The WordNet does 
not provide complete domain knowledge. In WordNet only relations as synonymy, polysemy, 
antonymy, hypemymy, hyponymy, meronymy are available. Semantic Chains are however 
created to overcome the Tennis Problem [8,9] and create chains between words conceptually 
related by context. We have to generate the domain knowledge and represent that as the domain 
concept, in order to create semantic chains.. 

WordNet as a Tool 

WordNet is a lexical database which captures all senses of a word and contains semantic 
information about the relation between words. The algorithm first segments the text then for 
each noun in the segment and for each sense of the noun, it attempts to merge these senses into 
all of the existing chains in every possible interpretation of the segment. Next the algorithm 
merges chains between segments that contains a word with the same sense. The algorithm then 
selects the chains denoted as ‘strong’ (more than two standard deviation above the mean) and 
uses these to generate a summary. The WordNet semantic hierarchy is a central resource for a 
variety of sense disambiguation algorithms. Lexical cohesion arising from semantic convections 
between words, was successfully used as the only form of textual cohesive structure, known as 
lexical chains. WordNet offers the possibility to discriminate word sense in documents and then 
retrieval accuracy could be improved. The performance of word based information approaches 
can be measured by introducing different rates of disambiguation errors in the collection 

Semantic based indexing is to done for indexing the documents in efficient ways for information 
retrieval. The four indexing spaces are the original terms in the documents, the word senses 
corresponding to the document term, the WordNet synsets corresponding to the document and 
the concept based indexing. WordNet provides the chance of matching semantically related 
words. And beyond synonymy, WordNet can be used to measure semantic distance between 
occurring terms to get more sophisticated ways of compiiring documents and queries. Thus, 
indexing by synsets gets maximum matching and minimum spurious matching, seeming a good 


267 



starting point to study text retrieval with WordNet. Indexing by word senses also improves 
performance of text retrieval. Indexing by word senses prevents some matchings that can be 
useful for retrieval.. The classical vector space model for text retrieval is shown to give better 
results (up to 29% better in our experiments) if WordNet synsets are chosen as the indexing 
space, instead of word forms. The sensibility of retrieval performance to disambiguation errors 
when indexing space of retrievals performance to disambiguation errors when indexing 
documents is also measured [10]. 

Domain specific Concept Model 

The information available in document collections are in three different levels - lexical levels, 
domain specific level, and application specific level. The incorporation of linguistic knowledge 
was a central theme. WordNet is a source of reusable knowledge that can be used during 
conceptual modeling to ensure that the resulting models are correct. One of the main reasons for 
incorporating linguistic knowledge into conceptual modeling is to make the use of the words 
appearing in the models consistent. The domain knowledge is not available in WordNet because 
it may not be valid in other domains. 

We can use WordNet as a source for reusable general information for building conceptual models 
- reusable knowledge library. Lexical knowledge from WordNet is applicable in every universe of 
discourse but usually the meaning of a word used in certain universe of discourse is restricted. 
We view the reusable knowledge library as a multi level repository based on specificity or 
refinement, of the concepts found at each level. One level stores the lexical knowledge which is 
by and large static and unchanging but can be extended with new concepts or new insights. A 
second level stores a domain specific knowledge, consisting of concepts and relationships 
between the concepts which are typical for the domain but whose specific meaning cannot be 
found in the lexical knowledge levels. 

The domain specific knowledge level should be build from scratch, but lexical knowledge of 
course serves as a good starting point because domain specific knowledge will be derived from it. 
To build a model of a certain universe of discourse, one must analyze the document and elicit the 
information to be put in the colour-x models with objects, user-defined and standard or 
conceptual relationships. These can be treated separately to show exactly what kind of WordNet 
information is used during the creation of conceptual models. 

Because of the ambiguous nature of natural language, words can have several meanings and 
many concepts are denoted by two or more words. With the help of WordNet we try to find the 
right word meaning in each specific universe of discourse. The more general information about 
the word is taken from WordNet using WordNet’s substance, part, and member meronymy 
relationships. The part and member meronymy information will be translated into aggregation 
relationships, and the substance meronymy information into class attributes. 


Like the meaning of an object, the meaning of a certain relationship, denoted by a verb, can be 
ambiguous. The first step is to choose the right meaning. The next step is to choose the structure 
or selectional restrictions of the verb. The number of objects involved in a relationship is stored in 
WordNet in the form of verb frames. 

The following information retrieved from WordNet points to a standard (conceptual) relationship. 

Hyponymy/hypemymy: specialization/generalization (IS-A) 

Part and substance meronymy : aggregation (HAS-A) 


268 



The possession and instantiation relationships are domain specific, and information about them is 
not available in WordNet. When we are extracting the knowledge from the documents, we need 
the rich concepts in documents and its connectivity with small set of relation types. The defining 
feature of its concepts provides context for its concept. We can take word defining features (other 
concepts) from the gloss of the WordNet and their semantic connections of the concepts from 
WordNet. When a concept belongs to a hierarchy, it inherits the properties of its hypemymy. A 
concept inherits properties through some of its defining feature concepts. The idea is to transform 
each synsets gloss into a defining features directed tree with synsets as nodes and lexical relations 
as links. Finally we get the knowledge base for the particular domain as interlinked domain trees. 
WordNet and this knowledge are taken as base for semantically indexing the documents, creation 
of semantic chains, semantic frequent keyword generation and semantic tagging. 

Incorporation of Semantics in Text Mining 


As the manual indexing is very time consuming task, automated indexing of the textual 
documents, have to be performed for the preprocessing phase of the text mining. The SMART 
information Retrieval System is the one of the techniques for indexing of the documents using 
frequency- based weighing schemes. Indexing can also be done by the word senses corresponding 
to the document terms. The WordNet synsets corresponding to the documents terms and also by 
the concepts taken from the concept domain tree. This takes the knowledge from the WordNet. 
So this kind of text retrieval is called WordNet based Text Retrieval. The problem of WordNet 
for text retrieval include too much fine grained sense distinctions and lack of domain 
information. In this paper the domain, information is taken from the domain concept tr^e. The 
performance of text retrieval by all the three methods are obtained by measuring the accuracy of 
the document retrieval. 

Tagging of the documents is the preprocessing phase of the text mining. First step in 
preprocessing is tokenizatibn. The tokenzier joins. independent words to form a collection by 
replacing intervening spaces, then appending a syntactic tag. Adjacent proper nouns also'joined 
and assigned with syntactic tag as NNP. Brill’s part of speech taggerfl 1 ] is used to automatically 
assign syntactic tags to the document collections. As the accuracy of the syntactic tags is not 
perfect, their presence speeds up sense assignment to display the set of WordNet senses for a 
word with the analysed part of speech. The selected senses are stored with the text representing 
links between it and the WordNet database. These links are called semantic tags. The 
performance of the text mining is improved by tagging the document collections semantically. 
Semantic tagging should be viewed as more than disambiguation between senses. The semantic 
tagging leads to a new type of semantic tagging where the tagging accounts for the senses of the 
word rather than the word itself. This is supported through a design based on systematic 
polysemous classes and a class based acquisition of lexical knowledge for specific domain. 

After the preprocessing steps, the semantic indexing structure is obtained for the document and 
will serve as a basis for the frequent term set generation. The Text mining techniques can be 
extended to a distributed environment [12], where now the basis of indexing is semantic based. 
Next step is to cluster the term set based on a similarity measure derived from the number of 
common terms in the sets. WordNet knowledge is used for deriving the semantic measure 
between the common terms in the set which does not consider the concept. The concept is also 
taken for finding the semantic measure by the domain concept tree. 


269 



Nouns 


Relations 

Synonymy 

!H ypemy my-Hypony my 
jHyponymy-Hypemymy 
jHolonymy-Meronymy 


Meronymy-Holonymy 

Opposites 


Multiple opposites 

I 

j . . . 

»» 

Lexical association 
>* 

Compatibility 


Subtypes 


jWholes to parts 
Groups to members 


'Parts to wholes 
Members to groups 
Antonymic (gradable) 


Example 

i Puttakam 'book'Jo nduu l 'book' 
j Vilangku 'animal' to pa^aluuTTi } mamim.Y 
'Pacu 'cow'to paaluuTTi 'mammal' 
Meecai 'table' to kaal |_'leg'_ 

TuRai 'department' to peeraaciriyar 

fprofessor*_____: 

I Cakkaram 'wheel' to vaNTi 'cart' 


Complementary 


PaTaittlaiyar 'captain' XopaTai 'army' 
Ndallavan 'good person' to keTTavan 

'bad person'_ ; 

aaN 'male' to peN 'female' 

Privative (opposing features) \AhRiNai 'irrational' to uyartiNai 'rational' 
Equipollent (positive features) I aaN 'male' to peN 'female' 

Reciprocal Social roles Vaittiyar 'doctor* to ndooyaaLi 'patient' 

Kinship Relations Ammaa 'mother' to makaL 'daughter' J 

Temporal Relations j Munnar 'before' to pinn ar 'after * 

[Orthogonal or perpendicular VaTakku 'north' to kizakku 'east' and 

pieeRku 'west'_ • 

■Va Takku 'north' to teRku 'south' j 

| OnRu 'one', iraNTu 'two', muunRu 'three',] 
hdaanku 'four' | 

NjaayiRu 'Sunday' to tingkaL 'Monday'.. 1 

to cani 'Saturday'_ j 

Cingkam ^on[XoJcarji 'roar'. 

[Pa77 'stud)A topaTittayan 'educated man' 
Ndaay 'dog' to cellappiraaNi 'pet' 

Table 1: Lexical relations 


[Antipodal Opposition 
(Serial 

i 

Cycle • 

Collocation 
Morphological relations 


Verbs 

'Relations 

(Synonymy 

Meronymy- 

[hypernymy 

Troponymy 

Entailment 


>9 

Antonym 

JJ 


|Definition/sub types Example __. 

[Replaceable events tU Wlgk il 'sleep' uRangku 'sleep'_ 

Events to super-ordinate eventspaPa 'fly' pirayaaNi 'travel' 

Events to their subtypes tod aTa 'wa lk' ndoNTu 'limp' 

Events to the events they entail kuRaTTaiviTu ' snore ' tuu ngku 'sleep' 
Event to its cause 'uyar 'rise' uyarttu 'raise' 

Event to its presupposed event Wei 'succeed' muyal 'try' 

Event to its implied event kol 'murder' iRa 'die' _ _ 

[Opposites kuuT!* 'increase' kuRai 'decrease' 

Conversensess jvi/ 'seH' vaangku 'buy' 

pirectional opposites puRappaTu 'start’ vandtuceer 'reach' 


2.3. Other Lexical Databases 

The following relations with other databases will be captured: 

• Tamil senses of the T-Wordnet will be correlated to Tamil senses in the Madras University 
lexicon as well as English senses in E-Wordnet. 

• Collocational information 1 will be derived semi-automatically from the CIIL corpus 

• Verb frames and sample sentences for each sense will be provided by the TransLexGram project. 

1 This is a project to collect the collocational information for polysemous & homonymous words from the CIIL corpus 


272 















Tamil WordNet 


S. Rajendran, S. Arulmozi 
B. Kumara Shanmugam 
S. Baskaran, S. Thiagarajan 


Abstract 

A WordNet plays an important role both in the development of NLP applications such as a Machine 
Translation system and a Question-Answering system as well as for lexical studies of a language. 
While Wordnets have been compiled for most of the European languages, these resources do not exist 
for Indian languages. This paper presents the lexicographic and computational issues faced in an 
attempt to build a 'Tamil WordNet'. A working model will be ready at the time of presenting the ‘ 
paper. 

1. Introduction 


WordNet links words and concepts through a variety of semantic relations based on similarity and 
contrast. This project aims to build an interrelated lexicon in Tamil with the scope of the English 
Wordnet (E-WordNet). The design and implementation of Tamil Wordnet (T-WordNet) is similar to 
that of E-Wordnet with some added features. In an attempt to model the lexical knowledge of a native 
speaker of Tamil, detailed componential analysis was done to capture the meaning components of 
each lexical item. 

2. Linguistic Work 

The lexical items in Tamil can be broadly classified into twelve semantic domains. Hence this work 
naturally decomposes into two parts; (1) establishing these semantic domains and (2) fixing the 
relations between these domains. 

2.1. Lexical field analysis 
The four sub-aims of this part are: 

1. Establishing semantic domains and sub-domains. While broadly these will correspond to those in 
the English Wordnet, there are numerous minor differences. The domains will be roughly mapped 
into the E-Wordnet as has been done in various Euro Wordnets (Czech). Examples of such 
Semantic domains include people, animal, plant etc. 

2. Grouping lexical items in terms of semantic domains or fields. This corresponds roughly to 
synsets at the coarsest level. 

3. Refining the coarse synsets in (ii). This corresponds to finding the superordinate terms for smaller 
sets of lexical items, i.e., establishing "titles" for sub-domains. 

4. Identifying the fine grain componential features that differentiate one lexical item from another. 
This maps to the fine-grained synset concept. 

The work is based on extensive preliminary investigations of the vocabulary of Tamil based on lexical 
field analysis (Rajendran: 1978, 1983, 2001). Portions of this work have been compiled into a Tamil 
Thesaurus (Rajendran, 2001). 

2.2. Lexical Relations 

The familiar semantic relations (antonymy etc.) will be manually encoded by lexicographers. We will 
restrict ourselves to verbs and nouns in the first stage of the project. For verbs and nouns, the broad 
relations in the E-Wordnet (Beckwith et al., 1993) will be carried over unchanged (for nouns, a 
compatibility relation has been added). Finer sub-relations between senses will also be encoded. 
Examples are given in Table 1. 



Conclusions 


The paper discusses the domain specific knowledge to be interfaced with WordNet to improve the 
performance of information extraction. The interlinked domain concepts trees are generated for 
domain specific knowledge by identifying concepts and relationships and also by taking gloss for 
each concept. The performance of text retrieval is improved by semantic indexing of the 
document collections. This process takes the lexical knowledge from the WordNet and domain 
knowledge from domain concept tree. Semantic chains are also created for finding the semantic 
similarity measure between the common terms generated in the process of text mining of proto 
typical documents, which is now also based on conceptual relations based on context. 

References 

M.Rajman, R.Besancon. (1997) " Text Mining: Natural Language techniques and Text Mining 
applications" IFIP 1997. raiman@lia.di.epfl.ch 

Kjersti Aas and Line Eikvil. (1999) “Text Categorisation" A survey, Norwegian computing 
center, Norway, June 1999. 

Luhu H.P. (1968) “The automatic creation of literature abstracts' ’ In H.P Luhu: Pioneer of 
information Science, Schultz education spectrum 1968. 

Halliday, Michael and Rugaiya Hasan. (1976) “Cohesion in English “ Longman, London 1976. 
Gregory Silber, Kathleen F.Mcloy, “ Efficient Text Summarization using Lexical chains ,” 
Proceedings of the 2000 international conference on Intelligent user interfaces, 2000 pages 252- 
255. 

Morris.J and G. Hirst. “ Lexical Cohesion computed by thesaural relation as an indicator of the 
structure of the text: in Computational Linguistics” 18(1) pp 21-45. 

Stephen J.Green. (1999) “ Building hypertext links by computing semantic similarity" IEEE 1999. 
Miller.G.A. (1993) “ The association of ideas. The General Psychologist" , 29, 69-74. 

Miller.G.A. and Charles .W.G. (1991) “ Contextual correlates of semantic similarity" Language 
and Cognitive Processes, 6, 1-28. 

Julio Gonzalo and Felisa Verdejo. “ Indexing with WordNet synsets can improve text retrieval. 
Eric Brill. (1994) “Some Advances in Transformations-based Part-of-Speech Tagging". In 
proceedings of the Twelth National Conference an Artifical Intelligent, Volume 1, Pages 722- 
727. 

D.Manjula and T.V.Geetha. (2001) “ Message Optimization using polling for Distributed Text 
Mining" NCDAR, Mandya, 2001. 


270 



A morphological analyzer for Tamil will also be incorporated into the user interface to the T-Wordnet 
retrieval system. 

3. Unique feature of T-WordNet 

Wordnet focusses in the semantics of words and concepts rather than on semantics at the text or 
discourse level. So, the E-Wordnet contains no relations that indicate the word's shared membership 
in a topic of discourse. Roger chaffin's "Tennis problem" is a classic example for this. Wordnet does 
not link racquet , ball and net in a way that would show that can be expressed by court game. The T- 
Wordnet will capture such relations between the words. For example, the words maruttuvar 
'physician', ndooyaaLi 'patient' and maruttuvanlanai 'hospital' will be connected and so will be 
maTTai 'racquet', pandtu 'ball' and valai 'net'. 

4. Computational Work 

4.1. Internal Representation 

Wordforms and senses will be given unique indices. Glosses will be provided for every sense entry. 
The mapping from wordform index to sense index will be represented in tabular format in a relational 
database. Synsets are naturally encapsulated in this representation. Thus all wordform entries 
corresponding to a particular sense entry will perforce define a synset. 

Certain senses will not correspond directly to any particular word. Such senses will represent the 
major and minor semantic domains and super-ordinate terms, and will primarily stand for technical 
expressions. 

*1! 'a--.-'. , 

Relations, both binary and hierarchical, will again be represented in a tabular format. 

4.2. Input Interface 

In the E-Wordnet approach, lexicographic information is entered in separate files, which are then 
cross-checked and compiled into the internal format We have opted instead to write a graphical 
interface, which will enable the lexicographers to directly manipulate the database. This interface can 
be used to enter new information as well as manipulate existing entries. After each modification, the 
consistency of the database will be cross-checked by the equivalent of E-Wordnet's grinder module 
(1 st pass). 

Correlations to existing entries and structures presents one of the main bottlenecks in modifying the 
database. The interface will allow (1) extensive regexp-based searches of bpth senses and wordforms 
(2) visual display of existing hierarchies as trees and (3) modification of hierarchies by a point-and- 
click approach. 

43 . Other Databases 

Collocational calculations for word-pairs and triplets will provide the information for semi¬ 
automating the process of establishing relations with other databases and T-WordNet. 

1. The same user interface will allow sense mapping of Tamil senses to the E-Wordnet senses. 

2. The lexical relations in the T-Wordnet will be used as seed words for extracting common 
collocation information (e.g., word pair frequencies) for all words in a synset. We will use the 
CIIL corpus as well as corpora from Web sources. This, information will be used to locate the 
appropriate sense in the Madras University lexicon. 


2 The Tamil TransLexGram project is a collection of English to Tamil word mappings. Sample (translated) sentences in both 
languages are written for each separate sense. We are also writing frames for each verb. The pilot study will have 5000 
words and about 10000-15000 senses. 


273 


3. Mapping of corresponding senses in T-Wordnet and TransLexgram senses will first be made 
collocationally as in (b) and subsequently checked manually by lexicographers. 

4.4. MORPHOLOGICAL ANALYZER 

As Tamil is a morphologically rich language, it is impractical to store all the inflected wordforms in 
the database. Hence the database will contain only the root form of the words. The morphological 
analyzer will extract the root form and the grammatical features from a given word, which will be 
then used to search the database. The morphological analyzer will also be used to find the derivations 
of a particular word (for defining lexical associations- see Table I). 

5. USER INTERFACE TO T-WORDNET 

This will be identical to the E-Wordnet interface, but will have the capability to display additional 
information from related databases in the style of DICT ( www.dict.org) . 

Acknowledgements 

Central Institute of Indian Languages (CIIL), Mysore. 

Dr. K Subbiah Pillai, Dr. S Renganathan, International Institute of Tamil Studies, Chennai. 

S V Ramanan, The AU-KBC Research Centre, Chennai 

References 

Baker C F, Fillmore C J and Lowe J B. The FrameNet Project. 
http://www.icsi.berkelev.edu/~framenet/ 

Fellbaum, Christiane. (1990) English verbs as a semantic net. In: International Journal of 
Lexicography 3 (4), pp. 278-301. 

—. (ed.).(1998) WordNet: An Electronic Lexical Database. Cambridge, Massachusetts: The MIT 
Press. 

Gross, Derek and Katherine J. Miller. (1990) Adjectives in WordNet. In: International Journal of 
Lexicography 3 (4), pp, 265-277. •/; 

Miller, George A. (1990) Nouns in WordNet:A Lexical Inheritance System. In: International Journal 
of Lexicography 3 (4), pp. 245-264. 

Miller G A, Beckwith R, Fellbaum C, Gross D and Miller K. (1990) Introduction to WordNet: An On¬ 
line Lexical Database. In: International Journal of Lexicography 3 (4):235-244. 

Miller George A. (1991) Science of Words. New York: Scientific American Library. 

Nida E A. (1975) Compositional Analysis of Meaning: An Introduction to Semantic Structure. The 
Hague: Mouton. 

—, (1975) Exploring Semantic Structure. The Hague: Mouton. 

Rajendran, S. (1978) Syntax and Semantics of Tamil Verbs. Ph.D. Thesis, University of Poona. 

—. (1983) Semantics of Tamil Vocabulary, Report of the UGC sponsored postdoctoral work 
(manuscript). Poona: Deccan College Post-Doctoral Research Institute. 

..... (1997) Intricacies involved in the preparation of lexical net for Tamil , DLA conference, Telugu 
University, Hyderabad. 

_. (2001) taRkalat tamizc coRkaLanjciyam [Thesaurus for Modem Tamil]. Thanjavur: Tamil 

University. 

—. (2001) Preliminaries to the preparation of a WordNet for Tamil. National seminar on 
Computational Linguistics and Dravidian Languages, Annamalai University. 

S Rajendran, Tamil University, Thanjavur 

S. Arulmozi, AU-KBC Research Centre, Anna University, Chennai. 

B. Kumara Shanmugam, AU-KBC Research Centre, Anna University, Chennai. 

S. Baskaran, AU-KBC Research Centre, Anna University, Chennai. 

S. Thiagarajan, AU-KBC Research Centre, Anna University, Chennai. 


274 



Oriya WordNet 


S.Mohanty 
R. C.Balabantaray 
P.K.Santi 


Abstract 

Machine Translation (MT) in Oriya language is in its infancy. Nonavailability of proper Electronic 
dictionary has handicapped us to tackle the MT problem. This inspired us to develop the WordNet in 
Oriya language. We have tried to design the WordNet taking into account the speciality of this 
language too. In this WordNet the behaviour of each word and its category are being explained. 

Introduction 

The world is passing through the Information Revolution, in which information and knowledge are 
available to people in unprecedented amounts, whenever and wherever it is needed. Those 
communities that fail to take advantage of this New Technology may lack behind. The information is 
based on two technologies: Computer and Communication. 

India is the second largest in population in the world with one billion population. There are 18 
constitutional languages with 10 scripts and over 1650 dialects. Orissa is a state of India situated in 
the eastern region, with a population of 31.6 million according to 1991 census. Oriya is the official as 
well as spoken language of Orissa, and is one of the constitutional languages of India. 

People of Orissa can take full advantage of information revolution, if access to information 
technology can be provided to them at low cost and in their native language (Oriya). This means the 
development and use of technology related to Oriya language, as well as creation of Digital Contents 
in Oriya. This naive idea has inspired us to develop the WordNet, which we hope will help us for 
effective Machine Translation. 

WordNet is an online lexical database with proper information about word sense network. This lexical 
database deals with a more effective combination of traditional lexicographic information and modem 
high speed computation with proper semantic networking structure. 

Section 1: Preview on WordNet 

Work has already been initiated for Universal Networking of Languages (UNL) incorporating all 
major languages of the world where Indian National language “Hindi” has also been incorporated. In- 
India we have 18 official languages, which need networking for translation as they are mostly of a 
common origin. We have made an humble attempt to develop the Word Net for Oriya language with a 
hope to network it with other Indian languages for efficient Machine Translation of one Indian 
language to the other. Keeping this in view we have tried to design the Word Net where we have 
included some of the Pan Indian Words too. 

Section 2 contains the descriptions of original grammar base of the selected words. Section 3 has the 
table descriptions of the words consisting of different categories. Finally in our conclusion we have 
informed about the next course of work with respect to the present efforts. 

Section 2 : Oriya Grammer Analysis 

The structure of sentences in Oriya language like other Indian languages is different from that of 
English language. Generally the sentence structure of English language is ‘Subject Verb Object’ 



(SVO) but the corresponding Oriya language (also the other Indian languages) structure is ‘Subject 
Object Verb’ (SOV)- Like other languages, in Oriya language the words are of six categories: 

(i) Bisesya (Noun) 

(ii) KriyA (Verb) 

(iii) BisesaNa (Adjective) 

(iv) Sarba nama (Pronoun) 

(v) Abyaya (Preposition) 

(vi) KriyA bisesaNa (Adverb) 

In Oriya language somd words are used which are Oriya words but they are imported from some other 
languages like: Sanskrit, Arabi, Parsi, Partugese etc. * 

For Ex: 

Tiket (Ticket) is of English origin. 

Agni (Fire) is of Sanskrit origin. 

KAidA (Technique) is of Arabi origin. 

Behos (Senseless) is of Parsi origin. 

PadrI (Christian Missionary) is of Portugese origin. 

These words defined above are called bAYdeShika (Foreign) words. 

In Oriya language the bisesya (Noun) has various sub categorisation such as: 

(i) Byakti bAchaka bisesya (Proper Noun) 

Rama, Hari 

(ii) JAti bAchaka bisesya (Common Noun) 

Parbata(moutain), manuShya(man) 

(iii) Guna bAchaka bisesya (Abstract Noun) 

SoUndharya (beauty), namratA (politeness) 

(iv) Bastu bAchaka bisesya (Noun of material) 

Pathara (stone), sunA(gold) 

(v) KriyA bAchaka bisesya (verb used as Noun) 

Sayana (sleep), gamana (go) 

Sayana manare sAnti Ane. (Sleeping gives pleasure to mind.) 

Similarly the verbs can be classified into the following categories: 

SamApikA (shoibA, sleep): Shishuti shoichhi (The child is sleeping.) 

AsamApikA (khAi, eat): Se ethAru khAi chAligalA (He has left this place after eating.) 

Each word may contain some more information like: Gender, Person and Tense etc, which is very 
essential for MT. Though the information like Synonymy, Antonymy, Hypemymy and Meronymy of 
words are not mentioned in the dictionary but it is essential to have those while dealing with MT. 

Synonymy 

Two expressions are synonymous if the substitution of one for the other never changes the meaifing of 
the sentence in which the substitution is made. 

Ex: The substitution of bahi (book) for pustaka (book) will not alter the meaning of sentence. 
Antonymy 

The antonym of a word is the negation of the word i.e. the opposite meaning of the word, but this does 
not exist always. 


276 



Ex: Bhala (good) and manda (bad) are antonymous. 

AIeronymy • * -*> 

This is one kind of semantic relationwhich represent Part-Whole (HAS-A) relationship. 

x 

Ex: Human body has an eye. i.e. eye is a part of body ( which is treated as whole). 

Hypernymy 

Hypemymy is a semantic relationship which represents the Supersethood; 

Ex: Gachha (tree) is superset (hypemymy) of bara gachha (banyan tree). i 

Entailment 

Entailment refers to a relationship between two verbs. Any verb P entails Q, if the truth of Q follows 
logically from the truth of P. - ' - 

Ex: DauDibA (ran) entails chAlibA (walk) but the converse is not true (i.e. This occur only in one 
direction.) 

POLYSEMY 

In polysemy group, each word is associated with an unique number, to maintain proper references 
between words. This index is manually defined by seeing the position of words from dictionary. j 

UnqJD : This UfiqJD is a unique number to access the word form the database quickly (or just to 
transfer (he pointer to a particular position). 

Section 3: Design Part of Oriya WordNet 

The designing part of the Word Net in Oriya is based on the facts of basic relation or lexical links of 
the words. The following sets of tables are designed to store the lexical entries. The UniqueJD and 
the semantic links between them are grouped according to their different grammatical categories 

(noun, adjective verb, adverb, pronoun etc) and are stored in separate tables. 

The main tables and their fields are: 

Wordjable, Nounjable, Verbjable, Adjective_table, AdverbJable. 


Word table 

Noun table 

Verb table 

Adverb table 

Adjective table 

Root word 



Unique ID 



Noun cat 


Adverb cat 

Adjective cat 

Word cat 


Synonymy 

Synonymy 


Comment 


Entailment 




Hypemymy 

Tense 

Description 

Description 


Meronymy 

Description 




Description 



* 


Table 1: Tabular Description of the Database 


Table-1 contains the storage'structure of the tables along with their respective fields in the database. 

During the data entry of the word bahi (book) the following information is entered in the table 
word_table(Table 2) and Noun_tabIe (table 3). 


277 














Root word 

Bahi (book) 

Polysemy 

Bahi 3849.06, pustaka 
36.084813, grantha 

16.193532 

Word_cat 

Bisesya (noun) 

Comment 

Desaja (Original word) 


Table 2: Word_ table 


Table-2 contains the entries in the Word_table. 



3849.06 

Noun cat 

Bastu bAchaka 


Pustaka, grantha, hisAba khAtA 




PustakAgAra (library) 

Description 

Likhita rachanAbaLira ekatra prakAsana 


Table 3:Noun_table 


Table-3 contains the entries of Noun_table. 

The other tables (Verb_table, Adverbjable, Adjectivejable) are being constructed in this manner. 
Conclusions 

The above tabular description for a word gives an example of a word in the Word Net: How it is 
described and this says that Word Net is not just like a dictionary/Lexicon but it deals with some extra 
informations i.e. one can access different informations like (Synonymy, Antonymy, Meronymy, 
Holonymy etc) of a word and that to in a very speeder way. The Word Net is actually word sense 
network. In Word Net against each word one have to store an Unique_ID for the reference purpose. 
So, according to that ID the pointer can quickly move to a particular position for accessing the 
detailed information about the required word. This concept can be used towards solving the problem 
of word sense disambiguation more exactly. 

Acknowledgement 

This research is supported by a grant from Ministry of Information Technology, Government of India, 
New Delhi. 

References 

Fellbaum C. (1999) Word Net -An Electronic Lexical database , The MIT Press, Cambridge, 
Massachusetts, London, England. 

Jha S.K, Narayan D.K, P. Prabhakar & B. Puspak, Proceedings of the Workshop on Lexical Resources 
for Natural Language Processing, 5-8 January 2001, Hyderabad, P-8. 

Kar Pandit K.C (2000), TaruNa shabdkoSh, Grantha Mandira, Cuttack. 

Kar R. (1999), Abhinaba sAralA utkaLa abhidhAna , Orissa Book Emporium, Cuttack. 

Mohapatra Pandit N. and Dash S., SarbasAra byAkaraNa, New Students Store, Cuttack. 

S.Mohanty, Dept, of Comp.Sc. & Appl, Utkal University, Bhubaneswar-751 004,INDIA,sanqham1@rediffmail.com 
R.C.Balabantaray, Deptt. Of Comp. Sc. & Appl., Utkal University, Bhubaneswar-751004. INDIA 
P.K.Santi, Deptt. Of Comp. Sc. & Appl., Utkal University, Bhubaneswar-751004. INDIA 


278 















Indigenous Knowledge Systems in the Global WordNet: 
Focus on Car Nicobarese 


R. Elangaiyan 


Introduction 

There have been serious and successful attempts in placing English and other European languages in the 
Global WordNet. Also there has been ongoing work to put the other modem literary languages in the 
WordNet (like the modem Indian languages such as Hindi, Tamil, Bengali etc.); But nothing has been 
heard about (at least to the knowledge of this author) developing a WordNet for lesser known languages 
(especially tribal languages) which are spoken by indigenous peoples practicing distinct cultures and 
knowledge systems. Any attempt to develop a WordNet for a lesser known language, for instance, the 
Car Nicobarese language (spoken by the Niocobarese tribesmen, the inhabitants of the Car Nicobar island 
in the Union Territory of Andaman & Nicobar Islands, India) should be considered a very significant 
move in the direction of building a Global WordNet for several reasons. 

It is a common knowledge that every'linguistic community has evolved its own knowledge system 
through ages which gets reflected in the content and usage of its language. Human beings speak and 
communicate employing words and sentences (words should be considered more significant and 
fundamental in this context because words are the minimum meaningful utterances for communication - 
single worded responses to questions like ‘yes’; ‘no’ and imperatives like ‘come’ and ‘go’ can be cited as 
examples). And hence the entire lexicon (-the lexicalized concepts) of a language as a whole can be 
considered its content. The meaning ('word sense' in the context of a WordNet) and the semantic 
relationship between the words of a language can be considered its usage. It can be easily agreed that 
there are universal factors governing this semantic relationship and organization of words. But such 
factors play differently in different languages in different degrees. 

Traditional Dictionaries and the WordNet 

Unlike a traditional dictionary or lexicon, the WordNet attempts to describe the entire collection of words 
of a language as a single unit and thus it helps in defining the communicative pattern of a language. To 
study the communicative pattern of a language, no doubt, its grammar and morphology are very 
important. But, as the WordNet attempts to describe the entire collection of words as a single unit and 
because this process involves describing the semantic relationship between words and their hierarchical 
organization using word sense as the ultimate unit for communication, it can be said that the WordNet 
also plays an important role in revealing the communicative pattern of a language. In this process, all 
nouns are brought under one major category, all verbs are brought under another category and so on. The 
words in each category (each part of speech) are treated more as lexicalized concepts that are explained in 
terms of their semantic relationship with each other rather than individual words having discrete meaning 
definitions. And thus a holistic view of communication through language is projected. The way the 
lexicalized concepts of a language are organized into synsets and the way the synsets are placed in a 
hierarchy are crucial for the WordNet and it is in these respects that the WordNet differs from the 
traditional lexicons and thesauri (Miller, G.A. 1999). 

Nouns in WordNet 

In ‘Nouns in WordNet’ Miller has clearly explained the relationship between hyponyms and hypemyms 
and the way their relationship are expressed in the WordNet. Whenever a synset representing a hyponym 
is linked to another synset representing a hypemym, the semantic relationship is characterized as ‘IS-A’ 



or ‘IS-A-KIND-OF’ using the notation and the inverse semantic relationship (that is, from a 
hypemym to the hyponym) is characterized as ‘SUBSUMES’ using the notation ~ A hypemym may 
again become a hyponym to a higher order hypemym and a hyponym may become a holonym for a 
number of meronyms employing the whole-part semantic relationship. Building a hierarchy of lexical 
concepts like this has an ultimate aim of bringing the entire volume of words or lexical concepts of a 
language into a single hierarchy. Choosing the ‘UNIQUE BEGINNERS’ (Miller, G.A. 1999) is an 
opening exercise in this direction. 

Miller has listed twenty five unique beginners in ‘Nouns in WordNet’. According to Miller, WordNet 
divides nouns into several hierarchies each with a different unique beginner corresponding to a relatively 
distinct semantic field, each with its own vocabulary. He says, “a unique beginner corresponds roughly to 
a primitive semantic component in a compositional theory of lexical semantics.” This should, no doubt, 
be true of lesser known languages, especially tribal languages, as much as it is true of the modem literary 
languages. But such primitive semantic components must be more unique qualitatively and hence 
quantitatively in the lesser known tribal languages both among themselves and as compared to other 
modem languages as a whole due to evolution (or development) of such languages in isolation and 
uniqueness in culture and requirements for communication. If the primitive semantic components of a 
language are found to be unique, the way its words arc grouped into different categories and 
subcategories must be unique too, resulting in different number of such categories as compared to other 
languages. 

Nouns in Car Nicobarese 

When we go through the unique beginners proposed by Miller, it rightly appears that the vocabulary of 
any language, whether classical or modem, and whether literary or preliterate (like the tribal languages) 
can be easily accommodated in this framework. The point that I am trying to focus on in this paper is 
that, below this level of hierarchy, that is the unique beginners, the vocabulary of different languages will 
fall into different conceptual classes due to the peculiarities in the semantic relationship among groups of 
words and their categorization leading to differences in the hierarchy from language to language. I 
presume, the picture of the hierarchy down below the unique beginners must be very different in the tribal 
languages as compared to the modem literary languages as a whole. This is in spite of the harmony in 
general as to what a bird or what an animal or what a plant is. Like there is a hypemym ‘balan’- a Dyirbal 
concept (provided by Lakoff) which includes women, fire, dangerous things and several other items, the 
Car Nicobarese speakers classify the entire class of inanimate nouns (and a few animate nouns like body 
parts) into two classes when they are referred to in the oblique either as /0/ or Id meaning ‘it 
(accusative)’. (/0/ here is a rounded low mid back vowel). /0/ occurs with all animate nouns including 
plants. Id occurs with inanimate nouns but not with all. The inanimate nouns are classified into two 
groups. One group of nouns takes Id in the oblique whereas the other group of nouns takes /0/. A survey 
of inanimate nouns in Car Nicobarese reveals a very interesting phenomenon. The inanimate nouns that 
are considered by the Nicobarese speakers as ‘significant’ are referred to by the morph (a word as well) 
101 instead of Id. Any inanimate noun that is considered ‘by the Nicobarese’ as a tool or device in 
making something or getting something done is considered significant whereas a finished product is not 
considered so significant to take 101. A hammer or a chisel or a coconut scraper would take 10/ whereas a 
table or a chair or a cup would take only Id. Car Nicobarese words for tooth and finger (/kanap/ and 
/kunti/ respectively) would take 10/ whereas the words for other body parts like eye and nose (/mat/ and 
/elmeh/ respectively) would take only Id. In addition to this, there are a few culturally significant nouns 
like /a:p/ ‘(native) canoe’ and objects used in the Nicobarese rituals take /0/ whereas a ship or a motor 
boat takes Id. Images and pictures of human beings, gods and other super natural bodies take 101 but 
images and pictures of animals, birds, plants and other inanimate objects take Id. 


280 



“So the hierarchical organization of nominal concepts appears to be a necessary feature of the mental 
dictionary”, says Miller. The details given in the previous paragraph reveal a small fragment of the 
mental dictionary of the Car Nicobarese speakers. All Car Nicobarese nouns should be separated into two 
classes depending upon the morph (-meaning ‘it’) they take in the oblique. And this has to be done for 
the inanimate nouns of the Nicobarese before most of the other unique beginners proposed by Miller can 
be applied and operated upon. 


The primitive semantic components that govern the Car Nicobarese nouns as shown above can be 
characterized as hypemyms forming Synsettand Synsetc. Synset, would stand for the semantic component 
of ‘to be considered significant as a tool’ and Synsetc for the semantic component of ‘to be culturally 
significant’. The following examples would probably demonstrate how these lexicalized concepts in Car 
Nicobarese could be represented in the hierarchy of WordNet at one level. 

/mat/ ‘eye’ S b @-> S p @->S n , S a @-> S 0 S c 
/elmeh/ ‘nose’ S b @->.S p (a^S,* S a S 0 S e 

‘S’ is a ‘synset’; subscript ‘b’ is for the semantic component ‘body (part)’; subscript ‘p’ is for the 
semantic component ‘person’; subscript ‘ns’ is for the semantic component ‘not significant as a tool and 
not culturally significant’; ‘a’ for animate; ‘o’ for object and ‘e’ for ‘entity’. 

/kunti/ ‘finger’ S b S p @-»S, S a S 0 S e 

/kanap/ ‘tooth’ S b Sp.@->S, S a @-> S 0 S e 

The subscript‘t’ stands for the semantic component ‘significant as a tool’ 

/a:p/ ‘canoe’ S v S f @-» S c Si S 0 @-»S e 


The subscript V stands for ‘a vehicle’; ‘f for ‘an artifact’; ‘i* for ‘inanimate’ and ‘c’ for ‘culturally ' 
significant’. ... .... 

/co:ng/ ‘ship’ S v S ip S M Sj S 0 @-»S e 


The subscript ‘ip’ stands for ‘industrial production’; ‘ns* stands for ‘not significant as a tool and not 
culturally significant’. 


I do not claim that I have completely exhausted all possible groupings of the word senses in the above 
examples. The componential analysis of meaning attempted here could be still more elaborate leading to a 
longer chain of synsets but only a moderate attempt is made just to highlight the uniqueness of the Car 
Nicobarese language. 


In the hierarchy presented above, the semantic roles of hyponym-hypemym and meronym-holonym are to 
be found and the Nicobarese primitive semantic components, namely, ‘be a significant object as a kind of 
tool’ and ‘be culturally significant’ are exposed. When we analyse the vocabulary of the Car Nicobarese 
language, we find more interesting phenomena like this each projecting a unique semantic component. 


Typicality and Uniqueness 


Car Nicobarese does not have a generic word for ‘man’ or ‘people’, /tarik/ is a native Car Nicobarese 
man or woman, /ta-3:/ is a native Nicobarese man or woman, from other islands and /ta-3iny/ is any 
outsider. Thus the generic to specific and specific to generic semantic relationships of words are 
strikingly unique in the lesser known languages like the Car Nicobarese. But it is to be agreed that the 
‘typicality’ of word sense of the vocabulary of all languages have much in common. “Perhaps the 
typicality and hierarchy coexist”, says Miller. But as I believe, both typicality and uniqueness contribute 
to the hierarchy of lexicalized concepts in all human languages. Typicality must be promoting more 
commonness to the hierarchy in different languages whereas the uniqueness must be promoting variations 


281 



qualitatively and quantitatively to the primitive semantic components which in turn should alter the type 
and number of unique beginners affecting the hierarchy. 

Verbs in Car Nicobarese 

Verbs in Car Nicobarese do show remarkable uniqueness. The semantic concept ‘climb up’ has different 
lexical verbs depending upon the objects that are climbed upon. 

/yO:ld/ ‘to climb up a tree, wall, a fence etc.’ 

/yOh/ ‘to climb up a ladder, steps etc.’ 

/y0kl5/ ‘to climb on to a table, platform, stool etc.’ 

/rdt/ ‘to climb up a coconut tree’ 

In the same way there are verbs matching to the above to denote the lexical concept ‘to climb down’. 
Here, the hyponym-hypemym semantic relationship is controlled by the type of objects with which the 
action is associated and thus it affects the make up of primary semantic components for verbs in Car 
Nicobarese. 

Similarly, the verb ‘to mix up’ has two words in Car Nicobarese - /kalux/ and /kateh/. /kalux/ occurs 
when the two objects to be mixed are solid and solid; solid and liquid; rice and vegetables; vegetable 1 
and vegetable 2; rice and curries etc. /kateh/ occurs when the objects to be mixed are liquid and liquid; 
same object-quality 1 and quality 2; when two sexes of same animals or birds are mixed; when different 
breeds of same animals or birds are mixed etc. 

Unique beginners for Verbs 

In ‘Semantic Network of English Verbs’(Fellbaum, C. 1999), we find a discussion on unique beginners 
for verbs but Fellbaum does not propose a set of unique beginners. She disagrees with Pulman (1983) 
who suggested just ‘be’ and ‘do’ which amounts to division between activity and stative verbs and she 
says "do in do my hair or do my room in blue clearly expresses semantically very specific and elaborate 
concepts”. She. farther says “whereas a hierarchical link between communicate and chat seemed entirely 
appropriate, the same link between do and communicate appeared less felicitous—the two concepts seem 
farther removed from each other than do communicate and chat". But it is not clear to this author why 
‘be’ and ‘do’ cannot become unique beginners for verbs in any language. The examples she has quoted, 
namely, ‘doing somebody’s hair’ and ‘doing somebody’s room in blue’ definitely have significant 
distinction in word sense for the word ‘do’. But both expressions definitely refer to an EVENT as against 
STATE (Jackendoff, 1983). If ‘do’ can be a unique beginner, then, the word senses ‘doing somebody’s 
hair’ and ‘doing somebody’s room blue’ may appear as hyponyms under different hypemyms taking 
different routes in the hierarchy and thus placed at different nodes in the network of English verbs. I 
agree that Fellbaum could have more important reasons for not accepting ‘be’ and ‘do’ as unique 
beginners (as she has worked extensively on networking of English verbs) but such reasons are not made 
explicit in her paper. Right now I believe that even in the lesser known tribal languages like the Car 
Nicobarese, the unique beginners for verbs could be a set of a small number of lexicalized semantic 
concepts and all the verbs of the language may be pulled together into a single hierarchical structure (as it 
is suggested in the case of nouns: Miller, G.A. 1998). 

Conclusions 

I have made a small attempt in this paper to explain the networking of nouns and verbs in the indigenous 
languages as compared to modem languages. I believe, elaborate studies on these lines will not only 
bring the unique knowledge systems of such indigenous languages to light but also will help in reviewing 


282 



the theory of networking of languages in general and to discover more interesting primitive semantic 
components in particular. 

This paper has to be considered only as a proposal for a WordNet for lesser known languages like Car 
Nicobarese and this author believes that very intensive fieldworks will have to be conducted for building 
WordNets for languages like Car Nicobarese which do not have adequate written literature. 

ACKNOWLEDGEMENTS 

I am thankful to Professor Udaya Narayana Singh, Director, Central Institute of Indian Languages, 
Mysore and B.D. Jayaram, CIIL, Mysore for the constant encouragement and support I received from 
them while writing this paper. My sincere thanks are due to S. Rajendran, Department of Linguistics, 
Tamil University, Thanjavur and S. Arulmozi, AU-KBC Centre For Internet & Telecom Technologies, 
Anna University, Chennai for the initial discussions I had with them. I would specially like to thank Arlo 
Griffiths, University of Leiden, who went through the first, version of this paper and offered valuable 
suggestions and most of them have been incorporated in this revised version. I am very much indebted to 
the critical review of the first version of this paper forwarded to me by the Secretary, Global WordNet 
Conference. This paper has been benefited by this review. It is needless to say that all residual errors are 
my own. I love to thank the natives of Car Nicobar Island from .whom I learnt the Car Nicobarese words. 

REFERENCES 

Fellbaum C. (1998, 1999) Semantic Network of English Verbs. In “WORDNET-AN ELECTRONIC 
LEXICAL DATABASE”, Christiane Fellbaum, ed., The MIT Press, Cambridge, Massachusetts, London, 
England, pp. 69-104. 

Elangaiyan R. (Forthcoming) A Grammar of Car Nicobarese. 

Miller G.A. (1998,1999) Nouns in WordNet. In “WORDNET-AN ELECTRONIC LEXICAL 
DATABASE”, Christiane Fellbaum, ed., The MIT Press, Cambridge, Massachusetts, London, England, 
pp. 23-46 

Rajendran S. (2001) Preliminaries to the Preparation of a Word Net for Tamil Paper presented in the 
“National Seminar on Computational Linguistics and Dravidian Languages” held at the Center of 
Advanced Studies in Linguistics, Annamalai University, 22-24, Feb, 2001. 


R. Elangaiyan, Central Institute of Indian Languages, Manasagangotrl, Mysore, India. 



283 



Creating A Bilingual Ontology: A Corpus-Based Approach for 
Aligning WordNet and HowNet 


Marine Carpuat 
Grace Ngai 
Pascale Fung 
Kenneth W. Church 


Abstract 

The growing importance of multilingual information retrieval and machine translation has made 
multilingual ontologies an extremely valuable resource. Since the construction of an ontology 
from scratch is a very expensive and time consuming undertaking, it is attractive to explore ways 
of automatically aligning monolingual ontologies which already exist. 

This paper presents a language-independent, corpus-based method that borrows from 
techniques used in infonnation retrieval and machine translation, for creating a bilingual ontology 
by aligning WordNet with an existing Chinese ontology called HowNet. We will present results 
to show that our method is capable of efficiently aligning ontologies with very different 
structures, as well as ontologies from languages that are very different from each other. 

1 Introduction 

The increasing popularity of the web and the growing inter-connectivity of the world have added, 
importance to the fields of machine translation and multilingual information retrieval. This has 
increased the need for multilingual ontologies. The Euro WordNet (Vossen et al., 1997), which links 
together languages such as Dutch, Italian, Spanish, German, French, Czech and Estonian with the 
American English WordNet (Miller et al., 1990), is an example of such a multilingual ontology. 

Since ontology building is a very time-consuming and costly process, the ideal method of building a 
multilingual ontology would be to align, or link, existing monolingual ontologies. 

Although such a task, at a first blush, may seem far from challenging, there are many reasons why it 
is less trivial than it seems. A manual alignment would take too long and be extremely expensive; 
while automatic alignment would run into many of the same problems faced by machine translation, 
among the more serious of them being the disambiguation problem. 


2 Motivation and Challenge 

In this paper, we propose a language independent, corpus-based method for automatically creating a 
bilingual ontology from two existing ontologies. Such bilingual ontologies are extremely useful for 
applications such as cross-lingual information retrieval and machine translation; however, current 
methods for construction of these ontologies rely on large amounts of human labour. There has been 
some work in ontology alignment (Palmer & Wu, 1999; Doit et al., 2000); however, these methods 
usually mainly utilize the structural information within the ontologies and therefore may not be 
applicable to ontologies that have vastly differing structures. 

Our proposal is to create a bilingual Chinese-English ontology by linking the American English 
WordNet 1 and Simplified Chinese HowNet together by their most basic concepts: the WordNet synset 
and the HowNet definition. The languages involved, Chinese and English, are very different from 
each other; and these two ontologies are very different in their structures and design philosophies, 
making this an excellent test for the portability and robustness of our algorithm. 


1 WordNet 1.6 was used in all our experiments. 



The ready availability of bilingual dictionaries may make this seem like an easy task, but the truth of 
the matter is that the task is made far from trivial by the fact that: 

1. The structure of HowNet and WordNet are vastly different, as will be described later, making 
a surface structural alignment impossible; and 

2. Many polysemous words (words with more than one sense description) have multiple 
translations depending on the sense, making it necessary to tackle the problem of word sense 
disambiguation, a complicated problem. 

Our method borrows from techniques found in Information Retrieval and Statistical Machine 
Translation. The rest of this section will first describe the WordNet and HowNet structures, and go on 
to give a thorough description of the algorithm. 

2.1 Ontologies: WordNet and HowNet 

WordNet (Miller et al., 1990) is an electronic lexical database in which English nouns, verbs, 
adjectives and adverbs are arranged in synonym sets, with different relationships linking the different 
sets. 

Like WordNet, HowNet (Dong, 1988) is also an electronic lexical database for words, which are 
mostly in Chinese (some English technical terms such as “ASCII” are included). HowNet and 
WordNet, however, differ greatly in their structure, coverage and sense granularity. WordNet aims to 
differentiate word senses from each other through the use of synonym sets, or synsets. These synsets 
are constructed by gathering synonymous word senses into nodes, which are constructed such as to 
allow a user to easily distinguish between the different senses .of a word. For example, given the noun 
“address”, the construction of the synsets {address, computer address}, and {address, speech} should 
be enough to assist a user in distinguishing between the different senses of the word. 2 The synsets are 
then in turn linked to other synsets through hierarchical relations such as hypemyms, h>ponyms, 
holonyms and meronyms. A total of 109,377 synsets are defined. 

The meaning representation in WordNet thus follows what is commonly referred to as a differential 
theory of lexical semantics (Miller et al., 1990); and in contrast, HowNet takes a constructive 
approach to building a lexical hierarchy. At the most atomic level is a set of almost 1500 basic 
definitions, or sememes, such as “human”, or “aValue” (iittribute*value). A total of 16,788 word 
concepts, or definitions, are composed of subsets of these sememes, sometimes accompanied with 
“pointers” that express certain kinds of relations. 

For example, in HowNet, the word “i/E” (scar) is associated with the definition (the English glosses 
are part of the original HowNet definitions): 

{ traced, #disease|^^, #wounded|5£'f£ } 
where “#” is a pointer that denotes a “relevance” relation. 

Unlike WordNet, synsets are not explicitly defined in HowNet. However, word synonyms can still 
be found by looking for words with identical definitions. For example, the words (fanatic) 

and “S#f(enthusiast/hobbyist) are both associated with the definition: 

{ human|A, ♦FondOf)^^, #WhileAwayjMP$ } 
where “*” denotes a relation of “agent or instrument of an event”. 

Relationships other than synonyms can also be extracted from the definitions. For example, the 
word “§I£F” (hobby) has the definition: 

{ fact|^1f, $FondOf|^ft, #WhileAway|vM } 
where “$” denotes a relation of “patient, target, possession or content of an event”. 

A detailed WordNet-HowNet structural comparison can be found in Wong & Fung (2002). 

2.2 Word Sense Disambiguation 

The biggest initial problem encountered was that of polysemous words and multiple sense 
translations. For example, the word “crane”, which has two sense definitions in WordNet, has at least 
two Chinese translations. The first of the two WordNet senses, “a piece of heavy machinery” is 


2 Example taken from (Miller et al., 1990) 


285 



usually translated as while the common translation for the second sense, “a large wading 

bird”, is *%”. 

To get a sense of the severity of this problem, we conducted a simple baseline experiment: 

1. Pick 2000 HowNet definitions at random and extract all the corresponding word entries. 

2. Translate each of these words to English. 

3. Associate each of these English words with its corresponding entry in WordN et. 


Average Number of HowNet Entries per Definition 


Average Number of WordNet Synsets per Definition 


5.4 


8.1 


Table 1: Results of Baseline Experiment 


Table 1 shows the results of the baseline experiment. The 2000 randomly chosen HowNet definitions 
each contain, on average, over 5 words, or entries. Furthermore, our simple alignment algorithm 
aligns each definition to an average of 8 WordNet senses. 

To create a finer-grained mapping, we took a similar approach to the Definition Match Algorithm 
(Knight & Luk, 1994), which compares words according to their contexts from example sentences 
and definitions found in a dictionary. Our approach also uses word contexts, but instead of using 
dictionary definitions, we extract word contexts from a large bilingual corpus. 

2.3 An Information Retrieval Approach to Bilingual Dictionary Extraction 

In order to be able to compare words using their contexts, we need to use a method that will allow 
cross-lingual comparison of words and their contexts. 

Fung & Lo (1998) developed an information retrieval-like method designed to allow comparison of 
word contexts across languages, and across corpora that need not be parallel. Given two languages, 
say English and Chinese, the algorithm consists of the following steps: 

1. Define a list of t seed words in both languages. The set of seed words from one language is 
a direct translation from those in the other language. 

2. For each of the other words w in the corpus, construct a context vector v relative to the seed 
words from the corresponding language (e.g. for an English word, consider only English 
seed words, etc.), such that v,, the i 4h element of v, is defined as: 

v i = TF Iw x IDF, 

where TF lw = term frequency (number of occurences) of seedword/ in the context of w 
IDF, — —log 

where n, = total number of occurences of seedword i in the corpus 
max(n) = maximum frequency of any seedword in the corpus 


max(«) 


3. 


Given a pair of words from the two languages — for example, an English word \v e and a 
Chinese word w c — the similarity between them is the distance between their context 
vectors. We use the S3 measure from Fung & Lo (1998), which is a combination of the 
cosine similarity and the dice coefficient: 


where 


similarity (w e , w ) = ^/-i (w,,cX . /f) x 

w ic = TF k x IDF, 

= TF„ x IDF, 


x;.,v*x>.’ 


2.4 Using Synsets for Word Sense Disambiguation 

The context vector similarity has been shown to be effective at extracting bilingual word translation 
pairs. What we are interested in, however, is, assuming that we already have a list of candidate 


286 



translations, the alignment of the proper translation pair to the correct word sense. Our method uses 
the similarity score and the synset information to aid in word sense disambiguation. 

1. Given a HowNet definition d, first extract its associated set of Chinese words and English 
translations. 

2. For each word from the English translations, find the WordNet synsets that it belongs to. 

3. For each of these candidate WordNet synsets s , 

3.1. If s contains only a single word (|j| = 1), expand it by adding words from its direct hyperset. 

3.2. Define: 

X %simllartiy(W' t w e ) 

similarity(d , s) = 

• 


w c ed 


appears(w) 


if n w > 0 
otherwise 

The candidate WordNet synsets are then ranked according to their similarity with the Chinese 
HowNet definition. The alignment “winner” is defined as the highest-ranking WordNet synset. 

Step 3.1 deserves some more explanation. In the course of our investigation, we noticed that there 
were many WordNet synsets that contained only one word, which correspond to word senses for 
which a reasonable synonym does not exist. (WordNet uses a short gloss to help the user to 
distinguish between such senses.) For example, the noun “support” has 11 senses defined in WordNet, 
of which 5 are singlerword synsets. The semantic meaning of these synsets range from “aid” to 
“supporting structure”. If our method relied solely on words from the synset, we would be unable to 
distinguish between these senses, which would be a major problem indeed. Therefore, words from 
the hyperset - the set of hypemyms of the current word - are included to aid in defining the meaning. 


where 


appears(w) = 


: 


3 Experiments * 

The bilingual data in our experiments was taken from the English-Chinesc Hong Kong News corpus, 
which comprises of almost 18,500 aligned article pairs (totalling over 6 million words on the English 
side) from news documents released between 1997 and 2000. On the Chinese side, the corpus was 
word segmented with a simple greedy maximum-forward-match algorithm, using .the entire HowNet 
vocabulary as a lexicon. The seed words list for the context vector construction was extracted by 
taking the monosemous words from WordNet and throwing out all those that had more than one 
translation in Chinese. 

It must be noted that even though the Hong Kong News Corpus is roughly parallel, our method does 
not restrict us to parallel corpora. Indeed, any large bilingual corpora will work for our method. 


3.1 Overall Results 

Table 2 shows the highest scoring alignments. For each HowNet definition, the highest scoring 
WordNet synset that was aligned to it, and the corresponding alignment score are shown. With a few 
exceptions, it can be seen that most of the time, our method is successful at doing a reasonable 
alignment of WordNet synsets to align to HowNet definitions . 

The “BeNot|#” and “BeGood|&iS” definitions deserve some further explanation. It is difficult to 
understand how “BeNot” aligns to a WordNet synset of {name, identify}. The HowNet entries that 
carry this definition, however, mostly have a rough meaning of “misrecognize”, or “mistaken 
identity”, hence the bizarre-seeming WordNet alignment result. Likewise, the entries for “BeGood” 
generally carry the meaning of being “in a good or desirable state”, which leads to “BeGood” aligning 
to the synset {state}. 


287 



HowNet Definition 

Top Aligned WordNet Synset(s) 

Score 

humanjA, #occupation[SR{&, employee^ 

{employee, worker} 

0.002456 

BeNotj# 

{ name, identify} 

0.002311 

humanjA, unable|/jf, undesired|=£f 

{master, original} 

0.0007193 

B eRecovered | M JM , S tateln i=aliv e| 41 

{revive} 

0.0004365 

image] ®f&, $carve|#§£lj 

{sculpture} 

0.0003106 

AlterForm|^^^ 

{top, pinch} 

0.0001777 

aValuej fit,rank|^0,elementary] 

{elementary, primary} 

0.0001083 

AimAtj^InJ 

{calculate, aim, direct} 

Hrrrh'nrMl 

attributelM'14, pattern]#^, physical^M 

{form, word form} 


break|^f0^ 

{break} * 

4.624x10 5 

pay(fd,possession=money|^ 

{pay} 

jrTtrrftpil 

BeGood| 

{state} 

EMI 

BeOpposite|Xtfi; 

{confront} 

hessesb 

donate^, possession=money|$ rft 

{subscription} 

mumim 

HoldWithHand|&ft 

{pass, hand, reach, pass on, turn 
over, give} 

4.9565xl0' 6 

AmountTo[,&i+, means^CauseToBe[f^^.^ 

{convert, changeover} 

KE5EB5fl 

time| Ib 3 ,(oJrest|fA§., education! W 

{break, pause, interruption} 

on 

Avalue|M '(4fl,form|^^,even|^J 

{even} 

Emi 

BeBad|SE$ 

{die, decease, perish, go, exit, 
pass away, expire} 

1.792xl0' 7 

AlterLocation[3£?r? [e] fig 

{exchange, change, interchange} 



Table 2: Top Ranking Alignments of HowNet definitions to WordNet Synsets. (Words enclosed in curly 
braces belong to the same synset) 

3.2 Individual Examples 


The previous subsection gave an overview of the results. In order to have a better idea of the method, 
we will consider individual examples in this subsection. 

3.2.1 Example 1: Sculpture, or School Principal? 


We first consider the HowNet definition 

{ image|®f&, $carve|$t|ij } 

which includes the words '%M n (carving), “3* A ’’(bust), “8i£lJ”( C arving), “JUffi” (sculpture), 
(statue), ‘(head), “^$1” (relief sculpture) and “#&i!”(clay sculpture), among others. 
Using the words from the English translations directly, we have a mapping to 77 WordNet synsets. 


WordNet Synset 

Similarity with HowNet definition 

{sculpture} (sense 1 of sculpture) 

0.0003106 

{principal, school principal, head teacher, head} 
(sense 13 of head) 

0.0003091 

{mud, clay} (sense 2 of clay) 

0.0003072 


The WordNet synset that best fits the HowNet definition is “sculpture”, which is appropriate given 
that all the various objects defined by the HowNet definition are various types of sculptures. 

3.2.2 Example 2: A Shape, or the Flow of Traffic? 

As another example of the difficulty of our task, and to demonstrate the success of our method in 
distinguishing between different word senses, we consider another HowNet definition: 

{ (attribute), (pattern), (physical)} 


288 












































which includes the following words: “5£” (pattern, form) and (model, style). The English 

words map to a total of 38 senses in WordNet. 


WordNet Synset 

Similarity 

{form, shape, pattern} (sense 3 of form) 

■BeEEZSjSH 

{form} (sense 8 of form) 

hypemyms: document, written document, papers 

3.725x10‘ 5 

{traffic pattern, approach pattern} (sense 7 of pattern) 

T'HJi 


3.2.3 Example 3: An Interruption, or a Fracture? 

Another good example of the strength of our algorithm can be demonstrated by the HowNet definition 

{timejftffa), @rest|4K&, education^W } 

This definiti on has only one entry: “UEST (break). 


WordNet Synset 

Similarity 

Pause, intermission, break, interruption, suspension 
(sense 7 of break) 

1.43x1 O' 7 

fault, geological fault, fault line, fracture, break 
(sense 2 of break) 

1.03x10‘ 7 


, our 


method correctly ranks the WordNet sense corresponding to (pause, intermission, break, etc} ahead 
of the synset that corresponds to geological faults. 

3.2.4 A Problematic Example: AlterForm/jgf&tf and “ break ” 

The HowNet definition {AlterForm|3£7£$} and the WordNet synset {break} provide challenging 
cases for our method. The set of words in the HowNet definition are very diverse and include entries 
such as *§ (congeal), (curdle/agglomerate) and # (break into pieces). The WordNet synset 
{break}, on the other hand, is very finely divided with 63 total senses, with 34 of them being single¬ 
word synsets. Several senses of “break”, with the following partial ranking, were among the possible 


WordNet Synset 

Similarity 

{break} (sense 25 of break) 

No hypemyms 

4.6234x10° 

mi 11' hi ■"iniii' i iiiiii '!■ 

4.522x10* 

{break} (sense 31 of break) 

Hypemyms: cancel, call off 

4.468x10'* 

{break} (sense 60 of break) 

Hypemyms: decrease, diminish, lessen, fall 

4.439x10° 

Bankrupt, ruin, break (sense 35 of break) 

4.43 5x1 O'* 


Sense 25 of “break” carries the meaning of “cause to give up” (as in: to break oneselt ot a oaa 
habit”). A single-word synset without any hypemyms, it creates a difficult case to handle due to the 
lack of other semantic clues, and indeed, our method wrongly gives it the top rank. It is encouraging, 
however, that sense 45 of break, which is one of the most semantically similar senses, is given the 
second-highest relative ranking; and senses 31 and 60, which are also single-word synsets, are given 
reasonable relative rankings, which would not have been possible without hyperset generalization. 

3.2.5 WordNet to HowNet mappings 

In addition to the above examples, the reverse mapping of WordNet synsets to HowNet definitions 
can also demonstrate the capabilities of our method. As an example, we consider the word “board”, 
which is divided into 9 WordNet senses, ranging from “plank” to “dining table” to “committee”. 

When translated to Chinese, the synonyms for “board” in its various senses produce a total of 537 
HowNet entries, distributed among 39 HowNet definitions. Table 3 shows, for each WordNet synset, 
the highest-ranking HowNet definition that was aligned to it. (The WordNet gloss is included for 
synsets that contain only one word for the reader’s convenience.) 


289 











WordNet Senses for “board” 

HowNet Definition 

1 

board (a committee having supervisory powers) 

displayJM# 

2 

board (a flat piece of material designed for a special 
purpose) 

control 

a 

board, plank 

shaped Jf&,flat|^§,surfacial]® 

B 

display panel, display board, board 

display]^^ 

showj^r^^f 

5 

board, gameboard 

display] 

show|^$H$? 

6 

board, table 

shap e| fl at] fa , surfacial | ffi 

wood|^K 

7 

control panel, instrument panel, control board 

control j&ffrj 

El 

circuit board, circuit card 

part] StfffV/oimp lement] #§ ^ 


dining table, board 

shapej^J^,flat|^,surfacial]® 


Table 3: HowNet Definitions for various senses of the word "board" 


In general our method does find reasonable HowNet definitions for each synset. For example, it 
correctly identifies “circuit board” as an “implemented part”. Where it does not find the best 
alignment, for example, for the two senses of “table” and “dining table”, it does give reasonable 
descriptions: a “table” is reasonably described as having a shape that is “flat” and “surfacial”, and 
furthermore is commonly made of “wood”. 

An important thing to point out about this example is the problem caused by data scarcity. For the 
first sense of the word “board”, the correct HowNet definition is {institution|#l$J, *manage|l t 3, 
commercial]^}, which contains only one entry: the word (board of directors, or simply, 

board ). This word did occur in the Hong Kong Daily News corpus used in our experiments, but due 
to the sparseness of the seedwords in our corpus, its resulting similarity scores with the WordNet 
synsets were negligible. 

4 Analysis 

The examples presented in the previous section clearly show that in many cases, our method succeeds 
in finding a correct ordering of the WordNet senses that are candidates for alignment to a HowNet 
definition. However, there are some problematic situations that need to be addressed before a full 
HowNet-WordNet mapping.can be achieved: 

■ One significant problem is the difference in sense granularity between WordNet and HowNet. For 

example, the HowNet definition (livestock) includes entries as diverse as “$J” (dog), “4 : ” 

(cattle), (rabbit), while WordNet allocates these three words to different synsets. Therefore, a 
1 “to-1 mapping from all HowNet definitions to WordNet synsets does not exist. 

■ The seed word coverage that we are achieving with the HK Daily News corpus is also a matter of 

concern. As our seed words, we picked monosemous words in WordNet which had only one 
Chinese translation in our dictionary. This resulted in a list of very precise translations, but 
unfortunately, the words thus extracted also tend to be rare words, resulting in context vectors that 
had a lot of blank fields. To address this concern, and also to test the robustness of our approach, 
we plan to further experiment on comparable, rather than parallel, corpora. Since comparable 
corpora tend to be more plentiful than parallel corpora, we hope that the introduction of extra noise 
will be more than compensated for by the larger amount of available data. 

■ Another major problem encountered was that of non-compositional compounds, or NCCs. NCCs 

are word phrases whose meanings are “a matter of convention and cannot be synthesized from the 
meanings of their space-delimited components” (Melamed, 1997). Examples include phrases such 
as ‘ floppy disk” and “hot dog”. At the present, our method considers only singleton words and 
does not take such compounds into account, though it is intuitive that these compounds should not 
be broken up. We are considering using techniques borrowed from Chinese word segmentation, 
together with a large dictionaiy, to join words in these compounds before running our method on it. 


290 













■ Other concerns include the IR-like technique used to tackle this problem, which also leads to some 
problems. Measures such as TF and IDF do not take into account the syntactic function of a word, 
causing nouns and verb forms of the same word to have the same weight, which is likely 
suboptimal. Furthermore, techniques such as stemming should also be included into the method, 
which would likely be able to capture the way a word is used more accurately. 

Even though these are real concerns that should be addressed, none of them appear to be due to a 
fundamental flaw in our algorithm, and some of them (for example, the corpus problem and the 
stemming issue) are easily surmounted. Our method still succeeds in the partial aligning of two 
ontologies, which were constructed using very different approaches and philosophies and therefore 
differ vastly in their structure. 

5 Related Work 

The previous work that directly led to our research has been described in the earlier sections. This 
section will describe related work that targets the same problem. 

There has been some interest in aligning ontologies. Dorr et al. (2000) and Palmer & Wu (1995) 
took a structural approach to this problem. They focused on HowNet verbs and used thematic-role 
information, which denotes the contexts in which a particular verb may occur. The HowNet 
thematic-role specifications are mapped to word classes in an existing classification of English verbs 
called EVCA (Levin, 1993), whose structure is similar to that of the verb classes in HowNet. These 
mappings are then used to align English EVCA verbs to Chinese HowNet verbs. In Japanese, 
Asanoma (2001) used structural link information to align nouns from WordNet to a pre-existing 
Japanese ontology called Goi-Taikei via the Japanese WordNet (Hayashi 1999), which was 
constructed by manually translating a subset of WordNet nouns. 

There has also been a lot of interest in automatic bilingual word alignment and dictionary induction. 
The IBM Candide project (Brown et al., 1990) used statistical data to align words in sentence pairs 
from parallel corpora in an unsupervised fashion through the EM algorithm. Church (1993) used 
character frequencies to align words in a parallel corpus, and Fung <& Church (1994) used seed words 
to align parallel texts and extract bilingual word translation pairs. 

Word sense disambiguation is also a difficult and much-studied subject, and many approaches have 
been tried on the problem. Among the numerous efforts are those of Gale, Church & Yarowsky 
(1992) and Schuetze (1992), who applied vector space models and similarity measures to the problem; 
Yarowsky (1995) used decision lists, and Kikui (1999), who used word in non-parallel bilingual 
corpora to resolve translation ambiguity in an unsupervised fashion. 

6 Conclusions and Future Work 

In this paper, we present a language-independent, corpus-based method to align definitions from the 
Chinese HowNet with synsets from the English WordNet. Unlike many other ontology alignment 
techniques, our method does not take into account the structure of the ontology. Our method uses 
techniques borrowed from machine translation and information retrieval to calculate similarity scores 
between HowNet definitions and WordNet synsets; the alignment then relies on these scores to find 
the most optimal alignment between the ontologies. Since our method does not make any 
assumptions about the structure of the ontology, or use any but the most basic structural information, 
it makes it possible to align ontologies with vastly different structures. 

We show that our method is promising in its ability to produce a reasonably good mapping from 
HowNet definitions to WordNet synsets. There are some problematic situations that prevent us from 
achieving a full mapping at this point, but they do not appear to be serious or insurmountable. 

We plan to expand on this work in the future by addressing the concerns raised in the analysis 
section and producing a full alignment from HowNet to WordNet. We also intend to expand our 
algorithm to possibly integrate more structural information. Finally, our goal is to examine the use of 
the aligned ontology in applications such as cross-lingual information retrieval and machine 
translation. 


291 



Acknowledgements 

The authors would like to thank researchers at Weniwen Technologies — Ping-Wai Wong for his 

help and explanations on HowNet construction and structure; and Chi-Shun Cheung and Chi-Yuen 

Ma for their assistance in corpora and dictionary preparation. 

References 

N. Asanoma, Alignment of Ontologies: WordNet and Goi-Taikei. In Workshop on WordNet and Other Lexical 
Resources: Applications, Extensions and Customization. Pittsburgh, Pennsylvania, 2001. 

P.F. Brown, J. Cocke, S.A. Della Pietra, VJ. Della Pietra, F. Jelinek, J.D. Lafferty, R.L. Mercer and P. Roosin. 
A Statistical Approach to Machine Translation. Computational Linguistics, 16:79—85, 1990. 

K. Church. Char_align: A Program for Aligning Parallel Texts at the Character Level. In Proceedings of the 
31 s ' Annual Conference of the Association for Computational Linguistics, pp 1—8, Columbus, Ohio, 1993. 

Z. Dong. Knowledge Description: What, How and Who? In Proceedings of International Symposium on 
Electronic Dictionary. Tokyo, Japan, 1988. 

B. Dorr, G.A. Levow, D. Lin and S. Thomas. Large Scale Construction of Chinese-English Semantic 
Hierarchy. Technical Report LAMP TR 040, UMIACS TR 2000-17, CS TR 4120, University of Maryland, 
College Park, MD. 2000. 

P. Fung and K. Church. Kvec: A New Approach for Aligning Parallel Texts. In Proceedings of COLING1994, 
pp 1096—1102, Kyoto, Japan, August 1994. 

P. Fung and Y.Y. Lo. An IR Approach for Translating New Words from Nonparallel, Comparable Texts. In 
Proceedings of the 36 lh Annual Conference of the Association for Computational Linguistics, pp 414—420, 
Montreal, Canada, 1998. 

W. Gale, K. Church and D. Yarowsky. A Method for Disambiguating Word Senses in a Large Corpus. In 
Computers and the Humanities, 26:415—439, .1992. 

Y. Hayashi. Translating WordNet Noun Part into Japanese for Cross-Language Natural Language Applications. 
Technical Reports ofSIGon Natural Language Processing NL 130-10, pp 73—80, 1999: 

G. Kikui. Resolving Translation Ambiguity using Nonparallel Bilingual Corpora. In Workshop on 
Unsupervised Learning in Natural Language Processing, College Park, MD, 1999. 

K. Knight and S. Luk. Building a Large-Scale Knowledge Base for Machine Translation. In Proceedings of 
AAAI '94. pp773—778, Seattle, WA, 1994. * 

B. Levin. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, 
Chicago, IL 1993. 

I D- Melamed. Automatic Discovery of Non-Compositional Compounds in Parallel Data. ^Proceedings of 
the 2 nd Conference on Empirical Methods in Natural Language Processing. Providence, RL 

G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross and K. Miller. WordNet: An On-line Lexical Database. In 
International Journal of Lexicography, 3(4):235—244, 1990. 

M. Palmer and Z. Wu. Verb Semantics for English-Chinese Translation. Machine Translation, 10(l-2):59—92, 
1995. 

H. Schuetze, Dimensions of Meaning. In Proceedings of Supercomputing ’ 92 , pp 787—796, Los Alamitos, 
CA. IEEE Computer Society Press, 1992. 

P. Vossen, P. Diez-Orzas, W. Peters. The Multilingual Design of EuroWoidNet. In Proceedings of the 
ACL/EACL Workshop on Automatic Information Extraction and Building of Lexical Semantic Resources for 
NLP Applications. Madrid, Spain, July 1997. 

P.W. Wong and P. Fung. Nouns in HowNet and WordNet: An Analysis of Semantic Relations. In Proceedings 
of the I s ' International Conference on Global WordNet. Mysore, India, 2002. 

D. Yarowsky. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods, In Proceedings of the 

. 33 Annual Meeting of the Association for Computational Linguistics, pp 189-196, 1995. 


Marine Carpuat, Human Language Technology Centre, HKUST, Clear Water Bay Hono Kona 
eemanne@ust.hk 7 ’ y 

Grace Ngai, Weniwen Technologies, Clear Water Bay, Hong Kong, grace@weniwen com 
^ral^Fun^Human Language Technology Centre, HKUST, Clear Water Bay, Hong Kong, 

Kenneth W. Church, AT&T Shannon Laboratory, Fforham Park, NJ 07932, USA, kwc@research att mm 


292 



MultiWordNet: Developing an Aligned Multilingual Database 

Emanuele Pianta 
Luisa Bentivogli 
Christian Girardi 


Abstract 

This paper illustrates the MultiWordNet project, aimed at producing an Italian WordNet strongly 
aligned with the Princeton WordNet. The main conceptual differences between the MultiWordNet and 
the EuroWordNet conceptual models are presented first. Then two automatic procedures capable of 
speeding up the work of lexicographers are described. Finally, we give some details about the adopted 
data model and we present a graphical user interface that can be used to browse and update the 
aligned database. 

1. Introduction 


MultiWordNet is an on-going project at ITC-irst aiming at producing an Italian WordNet strictly 
aligned with Princeton WordsNet (PWN), see Fellbaum (1998). In its first version, it contains around 
37,000 Italian words organized into some 28,000 synsets, along with information about the 
correspondence 'between Italian and English (PWN) synsets. MultiWordNet adopts a methodological 
framework distinct from EuroWordNet. 

There are at least two models for building a multilingual wordnet. The first model, adopted within the 
EuroWordNet project, consists of building language specific wordnets independently from each other, 
trying in a second phase to find correspondences between them (Vossen, 1998). The second model, 
adopted within MultiWordNet (MWN), consists of building language specific wordnets keeping as 
much as possible of the semantic relations available in the Princeton WordNet (PWN). This is done 
by building the new synsets in correspondence with the PWN synsets, whenever possible, and 
importing semantic relations from the corresponding English synsets; i.e., we assume that if there are 
two synsets in PWN and a relation holding between them, the same relation holds between the 
corresponding synsets in the new language. 

According to Vossen (1996), the MWN model (or “expand model” in his words) seems less complex 
and guarantees the highest degree of compatibility across different wordnets. To see this, consider that 
building any wordnet necessarily implies a large number of subjective (and questionable) decisions. 
Thus, if two wordnets are built independently for two different languages, they will exhibit 
differences which depend only partially on divergences between the languages. Some non trivial 
structural discrepancies will in fact depend on subjective decisions or different building criteria. The 
MWN model minimizes these discrepancies by strictly adhering to the PWN building criteria and 
subjective choices. 

The MWN model also has potential drawbacks. The most serious risk is that of forcing “an excessive 
dependency on the lexical and conceptual structure of one of the languages involved”, as Vossen 
(1996) points out. This risk can be considerably reduced by allowing the new wordnet to diverge, 
when necessary, from the PWN. Note that at least one wordnet within the EuroWordNet project, 
Spanish WordNet, is built following the “expand model” (Atserias et al., 1997). 

2. Automatic Procedures in the Construction of MultiWordNet 

Another important advantage of the MWN model is that automatic procedures can be devised to speed 
up both the construction of corresponding synsets and the detection of divergences between PWN and 
the wordnet being built. In all these procedures PWN itself can be used as a useful resource. 



The construction of Italian WordNet, which is the first instantiation of the MWN model so far, is 
crucially based on two automatic procedures. The first is called the Assign-procedure. Given an 
Italian word sense, the procedure selects a weighed list of the most likely corresponding PWN 
synsets. This list is then used by lexicographers to actually build the Italian synsets. The second 
procedure supports the detection of lexical gaps (LG-procedure), which are cases when a lexical 
concept of a language is expressed through a free combination of words in another language (see 
below, Sect. 2.2). 

Both these procedures use, as a crucial linguistic resource, the electronic version of the Collins 
bilingual dictionary. The bilingual Collins is a medium size dictionary, including 40,959 headwords 
and 60,901 translation groups in the English section, and 32,602 headwords and 46,545 translation 
groups in the Italian section. By translation group (TGR) we mean a group of translation equivalents 
(TEs) translating one of the senses of a source language word. In bilingual dictionaries, TGRs are 
usually separated by semicolons. We take them as the relevant sense unit as they correspond to 
WordNet senses. In the following example wood has 5 TGRs as a noun, and 2 TGRs as an adjective: 

wood [wUd] 1. n a. (material) legno; (timber) legname (m) b. (forest) bosco c. (Golf) mazza di legno; 
(Bowls) boccia 2. adj a. (made of wood) di legno b. (living etc. in a wood) di bosco, silvestre. 

In this example only one TGR includes more than one TE, that is the second TGR of wood as an 
adjective. (2.b), which can be translated by either di bosco or silvestre. Note that a TE can be a simple 
word (bosco) or a phrase (di legno), and that each TGR is introduced by a gloss illustrating the sense 
of wood which is being translated. 

2.1. The Assign-Procedure 

Following the MWN model, our aim is to build, whenever possible, Italian synsets which are 
synonymous (semantically correspondent) with the PWN synsets. If this is not possible, then we have 
found an English-to-Italian or an Italian-to-English lexical idiosyncrasy. 

Italian synonymous synsets can be built following different strategies. The first strategy is based on 
English-to-Italian TEs. For each PWN synset S, we look for the Italian TEs which are cross-linguistic 
synonyms of the English words of S. The union of such TEs is the Italian synonymous synset of S. If 
we cannot build any Italian synonymous synset for S, we have found an English-to-Italian lexical 
idiosyncrasy. 

The second strategy is based on Italian-to-English TEs. For each sense a of an Italian word /, we look 
for a PWN synset S including at least one English TE of / and we establish a link between I and S. 
When the procedure has been applied to all Italian word senses, we can build the equivalence class of 
all sets of Italian words which have been linked to the same PWN synset. Each set in the equivalence 
class is the Italian synset synonymous with some PWN synset. If, for a set of Italian synonyms there 
is no PWN synonymous synset, then we have found an Italian-to-English lexical idiosyncrasy. 

The best alignment between Italian and Princeton WordNet can most likely be achieved by using both 
strategies and trying to cross-validate their results. As a matter of fact, so far we only exploited the 
Italian-to-English strategy. The same holds for Atserias et al. (1997). 

Finding links between Italian word senses and PWN synsets is a complex and time consuming task, 
even if less complex and much quicker than building Italian synsets from scratch, organizing them in 
a semantic net, and putting them in correspondence with PWN synsets. For each Italian word sense, 
the lexicographer needs to look up its TEs in a bilingual dictionary, find all the synsets containing 
such TEs, to look carefully at the meaning of these synsets (synonyms, glosses, semantic relations), 
and finally to decide which synset is synonymous with the Italian word sense, if any. For certain word 
senses the lexicographer may need to consider tens of PWN synsets. 


294 



To help the lexicographer in her work we devised a procedure that selects, for each sense of an Italian 
word, the PWN synsets which are most likely to have a comparable meaning, if any. In the best case 
the procedure selects only the right candidate, and the lexicographer only needs to confirm the 
selection. In the worst case the procedure finds only wrong candidates or it cannot find any candidate, 
and the lexicographer has to do all the work manually. In the most common case the procedure finds a 
restricted set of candidates including the right one, and the lexicographer needs to confirm the right 
choice and to reject the wrong ones. In other words the algorithm helps the lexicographer to focus on 
the most promising PWN synsets. 

The Assign-procedure takes as input one of the senses of the Italian-to-English section of the Collins 
dictionary and gives as output a set of candidates, each described by the pair <PWN synset, 
confidence score>, where confidence score (CS) measures the degree of confidence in the link 
between the Italian word sense and the PWN synset. Only candidates with a CS greater than a certain 
threshold are proposed to the lexicographer. Choosing such a threshold is a matter of balancing 
precision and recall. The greater the threshold, the lower is the probability that wrong candidates are 
proposed to the lexicographer (high precision), but also the greater the possibility that the right choice 
is not included in the set of candidates (low recall). See below for the actual evaluation of the 
algorithm. 

For a certain word sense listed in the Italian-to-English dictionary, the Assign-procedure considers the 
group of English words which are proposed as TEs for that word sense, and finds all the synsets 
containing at least one such TE. Such synsets constitute the set of candidates (CandSet) to be linked 
with the input Italian word sense. We can rephrase the first step of the algorithm by saying that it 
computes the CandSet of a certain Italian word meaning. The rest of the algorithm consists of 
ordering the CandSet by calculating the CS of each of its synsets. 

The ordering of the CandSet is based on a number of linking rules. Each rule, when successfully 
applied to a candidate, raises its CS. Note that the partial CS contributed by each rule varies according 
to factors specific to the rule. Besides the Collins dictionary, various resources may be accessed by 
the linking rules, such as an Italian monolingual dictionary, Italian nomenclatures, and PWN itself.' 
Also, the Italian gloss that introduces almost all TGRs in the Collins dictionary plays a crucial role in 
determining the value of the CS. 

We can group the linking rules into four main groups depending on the principle on which they are 
based: generic probability , back translation , gloss matching, and synset intersection. 

A. Generic probability. The generic probability rule is based on the assumption that only one 
candidate in CandSet is the right target for the linking of an Italian word sense. As a consequence we 
can assume that the bigger the cardinality of the CandSet, the lower the probability that each 
candidate is the right one. The cardinality of the CandSet depends on the degree of ambiguity of the 
words which are proposed as TEs of the input word sense. If there is only one synset in the CandSet, 
this means that all the TEs of the input word sense are monosemous. Thus it is highly probable that 
the only synset in the CandSet is synonymous with the input word sense. Compare the monosemic 
criteria used by Atserias et al. (1997). 

B. Back translation. The back translation rule exploits the following principle. Suppose we link a 
word sense to the correct target synset through a TE (linking-TE). Then it is probable that at least 
some of the PWN synonyms of the linking-TE have the input word as English-to-Italian TE. Take for. 
example the Italian word “puntura”. When referred to insects, the Collins gives sting as TE. However 
sting belongs to 4 synsets of PWN: {sting, stinging}, {pang, sting}, {sting, bite, insect bite}, {bunco, 
bunco game, ...}. Only the third synset is synonymous with the Italian word. If/we look at the 
synonyms of sting in the third synset we find that the English-to-Italian section gives puntura as a 
(back) translation of bite. Summing up, the back translation rule considers the PWN synonyms of 
some linking-TE, and calculates a partial CS that is proportional to the number of synonyms that have 
the Italian word as English-to-Italian TE. 


295 



C. Gloss matching. A set of linking rules exploits the information contained in the Italian gloss that 
introduces almost all TGRs. The gloss may contain a semantic field specification (e.g. “sclerosis n 
(Med) sclerosi”, where "Med" means medicine), a synonym (e.g. “reason 1. n a. (motive, cause) 
ragione,...”), a hypemym (e.g. “sole rt (fish) sogliola”), or a specification of the context of use (e.g. 
“handle 1. n ... (of knife) manico, impugnatura; (of door, drawer) maniglia;...”). This information 
can be used in different ways. 

The information about the semantic field is-exploited thanks to a resource which has been developed 
in parallel with MWN, the labelling of all PWN synsets with a semantic field label (see Magnini and 
Cavaglia’, 2000). If the Italian gloss contains a semantic field label, and if this label matches the label 
attached to a synset in CandSet, then the candidate gets a partial CS. For instance the label Elettr 
contained in “corrente n ... (Elettr, di acque) current” matches the label Electricity attached to the 
synset (current, electric current}. Variations in the form of the labels are handled via a 
correspondence table. 

When the Italian glosses contain words or phrases, we try a match between them and the words 
contained in PWN glosses. To do this, we extract the lemmas of the Italian and English words of the 
gloss, and we check whether one of the English words can be the TE of one of the Italian words. The 
strength of the matching depends on the ambiguity of the translation. The more polysemous the words 
the lower is the strength. 

There are two extensions to this mechanism, based on the fact that glosses often specify the genus of 
the word they are defining, instead of a synonym. The first extension tries a match between an Italian 
word and the hypemym of its TE. The second mechanism tries a match between an Italian word and 
an English word contained in the gloss of a hypemym of the candidate synset. If the match between 
an Italian and English word is achieved through one of the indirect mechanisms, the partial CS will be 
lower than in the direct case. 

A variant of the previous rale resorts to a more fine-grained analysis of the glosses. The Collins 
dictionary specifies the context of use of a word by following a restricted number of patterns. For 
instance to specify the usage of a noun, the Collins dictionary may use the pattern of+noun: see for 
instance “piega n ... (della pelle) fold ...”, where della pelle means of the skin. A similar strategy is 
sometimes used by the PWN glosses, for instance: “(fold, plica} -- (a folded part (as a fold of skin o« 
muscle))”. Thanks to the parallelism between della pelle in the Italian gloss and of skin in the WN 
gloss we can enforce the linking between the Italian word sense and the PWN candidate. 

This is done in fact by a specific linking rule, which executes a shallow parsing of the Collins and 
PWN glosses, isolates of+noun patterns and selects their nominal heads. To match the two heads the 
same translation-based technique explained in the previous paragraphs is applied. 

D. Synset intersection. This rule exploits the fact that TGRs can include multiple TEs, which of course 
are synonymous. If one of the TEs is ambiguous we can use the other TEs to disambiguate. In practice 
the rule takes the different sets of candidates which are accessible through the different TEs, and 
intersects them. The synsets which are in the intersection get a partial CS. For instance, the Italian 
word pilastro is translated in its metaphorical sense as “pillar, mainstay”. The word pillar belongs to 5 
PWN synsets, whereas mainstay belongs to 3 synsets. However there is only one synset that includes 
both of them: “(pillar, mainstay} - a prominent supporter; ‘he is a pillar of the community’”. This 
synset gets a partial CS from the rule. 

To assess the performance vf the Assign-procedure, we earned out an evaluation based on the nouns 
listed under the letter D in the lialian-to-English section of me Collins dictionary (the letter has been 
chosen randomly). We took the number of TGRs in this part of the dictionary as an estimation of the 
number of word senses for which the algorithm should be able to find some candidate synsets. We 
selected the candidates with a confidence score higher than a fixed threshold, that is the candidates 
that were actually proposed to the lexicographers. The number of such candidates is 89% of the 


296 



number of word senses listed in the Collins dictionary. Then, after the candidates were confirmed or 
rejected by the lexicographers, we calculated precision and recall of the candidates selected by the 
algorithm. The precision amounts to 70%, calculated as ratio between the number of candidates 
accepted by the lexicographers and the number of candidates proposed by the algorithm. The recall 
amounts to 63%, calculated as ratio between the number of candidates accepted by the lexicographers 
and the number of word senses listed in the Collins dictionary. 

2.2. The Lexical Gaps-Procedure 

The literature on contrastive analysis shows that, given a source and a target language, various types 
of idiosyncrasies can occur at the lexical level. However, only some of them are relevant to the 
information coded in MWN, which strictly follows the PWN building criteria. In MWN, a synset of a 
language LI containing'lexical units wj, ..., w„ has a correspondent in another language L2 if there are 
one or more lexical units in L2 which are cross-language synonyms of wj, ..., w„. As a consequence, 
only two kinds of idiosyncrasies imply a lack of cross-language correspondence in MWN (Bentivogli 
and Pianta, 2000): 

• lexical gaps: a language expresses through a lexical unit what the other language expresses with a 
free combination of words (borrower = chi prende in prestito ) (Hutchins and Somers, 1992). 
Following the MWN building criteria only idioms and restricted collocations are considered 
lexical units and thus can be synonymous with simple or compound words. On the contrary, a free 
combination of words is not a lexical unit and thus implies a missing synset for that language. 

• denotation differences : the TE of a source language exists but it is more general or more specific. 
In the former case the TE is a sort of cross-linguistic hypemym of the source language word and 
in the latter case it is a cross-linguistic hyponym (bell = (small/electric bell) campanello + (church 
bell) campana + (on cats) sonaglio). 

During the construction of Italian WordNet we developed a procedure for identifying lexical gaps in a 
semi-automatic way. The procedure is based on die distinction between idioms and restricted 
collocations on the one hand and free combinations of words (which imply gaps) on the other hand. In 
practice, the boundaries between idioms, restricted collocations and free combinations of words are 
not clear-cut. However, in many cases a distinction can be drawn by relying on knowledge contained 
in dictionaries that explicitly mark idioms and restricted collocations. Also, the three groups exhibit 
certain structural regularities that can be exploited to automatically distinguish them from each other 
with a certain degree of confidence. For further details see Bentivogli et al. (2000). 

The LG-procedure automatically classifies all TGRs of the Collins bilingual dictionary in three main 
classes: lexical units, lexica! gaps and TGRs that need to be manually checked and classified as 
lexical units or lexical gaps. The results are the following: 



Lexical units (%) 

Lexical gaps (%) 

Manual check t%) 

English-to-Italian 

88.4 

1.0 

10.6 

Italian-to-English 

92.1 

0.9 

7.0 


Information about lexical gaps can be used in two ways, depending on whether we are dealing with 
Italian-to-English gaps or vice versa. The Italian-to-English gaps point to a set of Italian synsets that 
need to be added manually in the Italian WN: we know for sure and from the beginning that such, 
synsets cannot be built in correspondence to any English synset and thus their construction cannot be 
based on the results of the Assign-procedure. On the other hand, information about English-to-Italian 
gaps point to PWN synsets which do not have a correspondent in Italian and can be excluded a priori 
from those selected by the Assign-procedure. 

It is worthwhile to note that the procedure is also significant from a theoretical point of view. In fact it 
provides as a further result an approximate quantitative evaluation of lexical gaps, showing that the 


297 










English and Italian lexica are highly comparable and thus giving strong empirical support to the 
MWN model. 

3. The data Model of MultiWordNet 

The data model underlying the MWN database reflects the main theoretical assumptions of the MWN 
model. The database is based on the idea that there is a set of data which are common to all languages, 
and other data which are specific for each language. We take for granted that semantic relations (has- 
hypemym, has-part, entails, etc.) are common, whereas lexical relations (has-synonym, derives from, 
etc.) are language-specific. In the current implementation, which contains only English and Italian 
data, the semantic relations of PWN are contained in a module called COMMON-DB, whereas the 
lexical relations for Italian and English are contained in other two modules called ITALIAN-DB and 
ENGLISH-DB. In other words the information about which words belong to which synsets is 
contained in language-DBs, whereas the information about the relations between synsets which hold 
for all languages is contained in COMMON-DB. Another crucial piece of information, that is the 
cross-language correspondence between synsets, is realized by using the same synset identifier in the 
different languages. All the synsets of different languages which have the same identifier belong in 
fact to the same multisynset. COMMON-DB describes the relations between the multisynsets of 
MWN. Note that all semantic information which is language-independent can be added to the 
COMMON-DB. This is in fact what has been done with the semantic field information (see above, 
Sect. 2.1). 

In the above description, we showed how the MWN data model represents the fact that different 
languages share a good deal of information at the conceptual level. However the data model aeeds 
also to represent conceptual divergences between languages (e.g. lexical gaps). Moreover, even if we 
take the PWN semantic relations as the basis for the COMMON-DB, we want to be able to add new 
semantic relations or to modify existing ones. The possibility of modifying the PWN semantic 
relations and ot representing conceptual Idiosyncrasies in specific languages has been implemented by 
rcsortir ^ to add-on modules which overwrite (without physically changing) the original PWN data. 
The COMMON-DB in fact contains all the original PWN semantic relations plus a COMMON-ADD- 
ON overwriting part of them. Also, each language-DB contains a language-ADD-ON specifying the 
semantic relations which are idiosyncratic to that language. Fig, i summarises the main features of the 
MWN data model. The arrows represent he overwriting relations. Within the COMMON-DB, PWN 
data are overwritten by the COMMON semantic ADD-ON, whereas the COMMON-DB is 
overwritten by the semantic ADD-ON of each language-DB. 



Figure l:The MultiWordNet Data Model 

Lexical idiosyncrasies are the typical kind of information which is coded in the language-specific add¬ 
ons. If there is positive lexicographic evidence that a certain lexical concept is missing in one 
language, a special empty-node label substitutes the synonyms in the lexical section of the language- 
DB. Then, two different strategies are followed to represent denotation differences vs. lexical gaps 
(see Sect. 2.2). If the empty node corresponds to a denotation difference, one or more nearest 
relations link the empty node to one more general or to many more specific synsets. If the empty node 
corresponds to a lexical gap, an appropriate translating paraphrase (introduced by the TE keyword) is 
reported in the gloss of the empty node. The nearest relations are included in the language-specific 
add-ons. Web 


298 






The interface (see Fig. 2) consists of three main areas: the search area (1), the synonym area (2), and 
the relation area (3). 

[yjBrow»or ot MultiWordnat 

if* . •• I 


Po»: 

MulUiynMt 


[03255317 
I Mechail;! 


[The noun albtro hit 3 i enter 


■lent hevlni» m«ln truck end br 


ipor for supporting saili] 


Figure 2: The MultiWordNet interface. 


Each language-DB contains also a module with lexicographic information about the (possibly wrong) 
link between word senses and synsets. See the following section for more details on this. 

The following table reports the data concerning the first version of Italian WordNet, which contains 
the most frequent words in the Italian lexicon. 


As regards relations, all common semantic relations are imported from PWN and are available as well 
as the nearest relations, i.e. the new semantic language-specific relations that have been added in 
MWN to represent denotation differences. On the contrary, synonymy is the only lexical relation 
instantiated so far. 


At the application level, procedures have been implemented for data consistency checking, 
encrypting, and statistics reporting. The MWN data model has been implemented both through a 
specialized database written in Lisp and as MySql tables. The LISP implementation is used to develop 
the resource and supports multi-user concurrent access and updating, whereas the MySql solution 
increases browsing efficiency. Both implementations run under both the UNIX and Windows 
environments. 


4. The Graphical Interface 


A graphical user interface has been designed to help the user in browsing and possibly editing MWN 
multisynsets. The graphical interface, implemented in Tcl/Tk, acts as a client of the central server and 
manages all MWN information. Two access modes are defined: the browse mode allows an end-user 
to access all information in the database, excluding the lexicographer cards, whereas the edit mode 
gives access to all available data, and allows the lexicographer to modify and add new data. 


----, 

Nouns 

Verbs 

MEKSSEM: ; 

Adverbs 

Total 

Word senses 

37,235 

8,296 

4,511 

1,805 

51,847 

Words 

26,463 

4,414 

4,220 

1,417 

36,514 

Synsets 

20,571 

4,130 

2,413 

1,006 

28,120 


299 






















In the search area (1) the MultiWordNet tabbed pane (ticked off in the figure) has the same 
functionalities as the PWN interface, including the ability to search by substring. Moreover, unlike the 
PWN interface, the MWN interface requires the user to select the language of the word she is 
searching (identified be a little flag near the input field). In addition, when a synset is selected in the 
display field, it becomes the focus of the interface and additional information about that synset is 
displayed in the other areas of the interface. The example in Fig. 2 shows also that whenever an 
Italian gloss is not available for an Italian synset, the English gloss of the corresponding PWN synset 
is displayed instead (in square brackets). 

The synonym area (2) contains information about the multisynset to which the focus synset belongs. 
This area is in turn split up in three zones. The lower one contains the synset of the search language, 
i.e. synonyms, glosses, and comments. The middle zone contains the same information for another 
language (currently only English is available). The upper zone displays the information common to all 
the synsets of the multisynset, namely the identification number, the part of speech, and the semantic 
field labels. The synset area also allows access to lexicographic information. The Synonyms... button 
on top of the lower zone opens a Synonym window (see Fig. 3) in which all the lexicographic cards 
related to the focus synset can be accessed. Each lexicographer card includes an evaluation about the 
assignment of a word sense to a synset. The word sense may be identified by reference to either a 
bilingual dictionary or the DISC Italian monolingual dictionary. Information about register of use, 
examples of use, and comments can be added. Note that a card can also state that a certain word sense 
should not be assigned to a synset (see the nn flag). This kind of information is as relevant as the 
positive assignments for documenting lexicographic work. 



Figure 3: The Synonym Window. 


In some cases, the lexicographer may want to access all the cards related to a certain word (instead of 
a synset as in the example above), independently from the senses and the linked synsets. To this 
extent a specific tabbed pane Synonym cards is available in the search area (see Fig. 4), as an 
alternative to the MultiWordNet pane described above. By specifying a word in the input field a list of 
candidate synsets is shown in the display field. Note also that negative assignments are accessible. 
From the described list the lexicographer can focus on a specific candidate synset as in the previous 
case. x\ 


Mr ■/.. A . - ; 


300 





imjtart fist of Word 




09396070 n yy 
02970751 n yy 
03308527 n nn 


[tree— a tall perennial woody plant having a main 
[ mast — a vertical spar for supporting sails] 

[ shaft — die vertical part of a column] 


Figure 4: The Synonym Cards Tabbed Pane. 


The third and last main area (3) of the MWN graphical interface displays information about the 
semantic relations of the focus synset. Whereas the MultiWordNet pane, following the PWN 
approach, shows information about one semantic relation at a time, the relation area shows 
information about all the semantic relations of a synset. 


Note that the interface gives access to the full MWN hierarchy. The users can also see relations from 
the focus synset to synsets which do not have any synonyms in the search language. In this case, the 
keyword GAP! is displayed in place of the synset. The GAP! keyword is used both to describe lexical 
gaps and denotation differences (see Sect. 2.2). When a GAP! node corresponds to a denotation 
difference the interface will display the nearest relations for that node. If the GAP ! node corresponds 
to a lexical gap, the appropriate translating paraphrase is displayed in the gloss. Semantic relations 
that are idiosyncratic to the search language (such as the nearest relation) are marked by a little flag. 


5. Conclusions 


In this paper we have discussed a model of an aligned multilingual database distinguished from 
EuroWordNet. MultiWordNet stresses the usefulness of a strict alignment between wordnets of 
different languages, while retainin the ability to represent true lexical idiosyncrasies between 
languages. One of the biggest advantages of the MultiWordNet model is the possibility to use 
automatic procedures to speed up the lexicographic work needed to build the wordnet of a new 
language. The Princeton WordNet itself can be used as a crucial resource by these procedures. 

A database model relying on MultiWordNet has been described, along with a graphical interface 
allowing users to browse and update the aligned database. Italian WordNet is the first instantiation of 
the MultiWordNet model. In its first version, it contains around 28,000 synsets and 52,000 word 
meanings. 


References 


Atserias J., Climent S., Farreres X., Rigau G. and Rodriguez H. (1997) Combining multiple methods 
for the automatic construction of multilingual WordNets. Proceedings of the International Conference 
"Recent Advances on Natural Language Processing" RANLP'97, Tzigov Chark, Bulgaria. 

Bentivogli L., Pianta E. and Pianesi F. (2000) Coping with lexical gaps when building aligned 
multilingual wordnets. Proceedings of LREC 2000, Athens, Greece, pp. 993-997. 

Bentivogli L. and Pianta E. (2000) Looking for lexical gaps. Proceedings of Euralex-2000 
International Congress, Stuttgart, Germany. 

Fellbaum C., ed., (1998) WordNet: An electronic lexical database. The MIT Press, Cambridge 
(Mass.). 

Hutchins W.J. and Somers H.L. (1992) An introduction to Machine Translation. Academic Press, 
London. 


301 







Magnini B. and Cavaglia’ G. (2000) Integrating Subject Field Codes into WordNet. Proceedings of 
LREC 2000, Athens, Greece, pp 1413-1418. 

Vossen, P. (1996) Right or wrong: combining lexical resources in the EuroWordNet project. 
Proceedings of Euralex-96 International Congress. 

Vossen, P., ed., (1998) EuroWordNet: A multilingual database with lexical semantic networks. 
Kluwer Academic, Dordrecht. 

MultiWordNet homepage: http://multiwordnet.itc.it 


Emanuele Pianta, ITC-irst, Centro per la Ricerca Scientiffca e Tecnologica, Via Sommarive, 18, 1-38050 Povo - 

Trento, Italy, pianta@itc.it wl „ ll(inRnDrtwn 

Luisa Bentivogii, ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, Via Sommarive, 18, 1-38050 movo 

Trento.ltaly.bentivo@itc.it ■ „ . , oonR o Dnwn 

Christian Girardi, ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, Via Sommarive, 18, 1-38050 Povo 

Trento, Italy, cglrardi@itc.it 


im 



An Unsupervised Method for General Named Entity 
Recognition and Automated Concept Discovery 


Enrique Alfonseca 
Suresh Manandhar 


Abstract 

Knowledge Acquisition is still the bottleneck in building many kinds of applications, such as 
inference engines. We describe here a procedure to automatically extend an ontology with domain- 
specific knowledge. The main advantage of our approach is that it is completely unsupervised, so it 
can be applied to different languages and domains. Our initial results have been highly successful 
and we believe that with some improvement in accuracy it can be applied to large ontologies. 


1 Introduction 

There are several general-purpose ontologies available for English and other languages, such as WordNet 
(Miller, 1995), Comlex [Macleod and Grishman, 1994, Mifflin et al., 1994], and Euro WordNet (Vossen, 
1998). However, extending them with domain-dependent information is still a labour-intensive task that 
requires a high degree of human supervision. Knowledge acquisition is today a bottleneck for construction 
of inference and expert systems, and there is a need for an automatic acquisition methodology. This paper 
presents an approach to enriching ontologies with domain-dependent information in a fully unsupervised 
way. For that aim, we have put together ideas from different fields in Natural Language Processing, such 
as named entity recognition, knowledge acquisition, and procedures used in word sense disambiguation, 
that we believe may be useful for solving the problem we have in hands. 

We describe here a procedure to extend ontologies with domain-dependent information. The only 
input it requires is a collection of documents collected for one domain, in our preliminary experiments, 
with absolutely no human supervision, the new synsets from the texts are correctly placed in the ontology 
we have used. 

1.1 Related work 

Lexical repositories are a very useful resource for Natural Language Processing, and the availability of 
a few of them such as WordNet [Miller, 1995], Comlex [Macleod and Grishman, 1994] or Cyc [Lenat 
and Guha, 1990, Lenat, 1995] has made possible many successful applications. There are now automatic 
procedures to port WordNet to other languages such as Catalan [Daude et al., 2000] and Korean [Lee 
et al., 2000], but it is still difficult to find good automatic methods to learn ontologies about specific 
domains. 

Concerning Lexical Knowledge Acquisition from dictionaries and other semi-structured texts, it has 
been already attempted with good results, using certain patterns in the definitions to identify the re¬ 
lations among synsets. To cite a few works, Wilks et al. [1996], Grefenstette [1994] and Rigau [1998] 
extracted WordNet-like ontologies from dictionary definitions. However, to our knowledge, there is still 
no procedure to enrich WordNet with domain-specific information from free texts that does not rely on 
human intervention in some way or other. O’Sullivan et al. [1995] extended WordNet with a domain- 
specific ontology, but all the work was done by hand by domain experts. Other systems have a higher 
degree of automaticity but all of them depend on a human expert in some degree to classify the learnt 
synsets. 

In the Asium system [Faure and Nedellec, 1998], a clustering algorithm is used to create a con¬ 
cept taxonomy. Maedche and Staab [2001] also describe a general architecture for acquiring ontologies 
and relationships directly from free texts, semi-structured texts (such as dictionaries) and data bases. 
While conceptual clustering is useful for grouping the concepts identified from a text, its application 
for extending already existing ontologies such as WordNet is not so straightforward. Furthermore, each 
time two concepts are clustered, a new superconcept is created as hypernym of them, which might have 



no counterpart in the language. Other works, such as [Maedche and Staab, 2000], focus on learning 
non-taxonomical relations. 

Kietz et al. [2000] describes a procedure to adapt WordNet to specific domains, by removing the 
non-relevant synsets and adding the domain-specific ones. When enriching WordNet with new synsets 
to the ontology, the system produces suggestions that have to be supervised by a human. 

1.2 Structure of this document 

Next section describes the task we want to achieve ultimately. Section 3 shows some word-sense disam¬ 
biguation techniques that we applied to our work, and section 4 introduces our algorithm. Finally, we 
present our results and conclusions in sections 5 and 6. 

2 General Named Entity Recognition 

Let’s suppose that we have an ontology with three components O —< C,I,h > where 

• C is a set of concepts (e.g. human). 

• I is a set of instances of concepts (e.g. Shakespeare). 

• h is the hypernymy function h: Cl)I -4 C that establishes a taxonomy of concepts and instances. 

An example of such function h is the one defined as h(c \) = c? iff the concept ci is a specialisation 
of the concept C 2 , and it reads either ci is a hyponym of c? or a is a hypemym of c\. 

Note that all members of I must be leaves in the taxonomy, but not all leaves are instances; some of 
them can represent concepts that have no instances in the hierarchy. 

2.1 Task definition ' •>» 

General Named Entity Recognition is the task of identifying, for an unknown concept or instance u, • 
the correct concept ceC such that h(u) = c, i.e. it consists in finding the most accurate immediate 
generalisation of u in the known hierarchy of concepts. 

For example, if we are processing a text about Tolkien’s mythology, and we find the unknown concept 
hobbit, an accurate General Named Entity Recogniser would have it attached to the most accurate 
hypemym, which is, in WordNet 1.7, fairy , and it would be brother of the existing concepts elf, dwarf, 
etc. 


2.2 Relation to Named Entity Recognition 

Named Entity Recognition is the task of finding and classifying objects that are of interest to us. These 
objects can be people, organisations, locations, dates, or anything that is useful to solve a particular 
problem. For instance, in the following sentences, taken from the Wall Street Journal corpus in the Penn 
Treebank [Marcus et al., 1993], we can find two references to a person, one date and one organisation. 

\ptrBon Pierre Vinken], 61 years old, will join the board as a nonexecutive director [datt Nov. 

29]. [perton Mr. Vinken] is chairman of [organisation Elsevier N.V.], the Dutch publishing 
group. 

We consider Named Entity Recognition as a more restricted task, where the hierarchy is flat and 
it only contains a few concepts, e.g. person, organisation and location. On the other hand, we are 
considering a taxonomy of concepts organised in a more sophisticated ontology, which can have ju9t 
subtle differences between them. 

2.3 Relation to Word-Sense Disambiguation 

In Word Sense Disambiguation, we have a word in a text, and the task is to decide which is the correct 
meaning of that word. For example, using WordNet this task involves deciding that out of the 10 senses 
of bank, in sentence (la) it refers to the slope beside a body of water-, and in (lb) it refers to a financial 
institution. 

(1) a. The boy played beside the river bank. 



Word 

Freq 

Wi 

102 

Word 

Freq 

Wl 

102 

Word 

Freq 

lOl 

w 2 

1 

1677 

1.13 

2.10 

Dwarves 

106 

1.21 

2.37 

Doom 

61 

1.17 

2.24 

by 

1124 

0.48 

0.61 

flowers 

97 

1.01 

1.75 

yellow 

61 

1.14 

2.15 

2 

658 

0.91 

1.49 

Races 

94 

1.21 

2.37 

pink 

61 

1.17 

2.24 

killed 

645 

1.16 

2.21 

fairy 

87 

1.22 

2.40 

Barbarian 

60 

1.22 

2.40 

he 

591 

0.02 

0.02 

giant 

84 

1.11 

2.04 

Deep 

58 

1.20 

2.35 

146 

307 

1.17 

2.24 

Killed 

84 

1.17 

2.25 

Dungeons 

58 

1.22 

2.40 

145 

230 

1.21 

2.37 

Hal fling 

80 

1.22 

2.40 

obtusa 

57 

1.22 

2.40 

Human 

218 

1.18 

2.28 

Cham. 

76 

1.22 

2.40 

Mixed 

56 

1.20 

2.34 

9 

213 

1.01 

1.76 

dwarves 

75 

1.04 

1.84 

Warrior 

55 

1.13 

2.12 

Elf 

212 

1.13 

2.10 

Pink 

75 

1.13 

2.11 

king 

54 

1.05 

1.87 

Gnome 

150 

1.19 

2.31 

Fairy 

70 

1.22 

2.40 

races 

53 

1.22 

2.40 

gnome 

138 

1.21 

2.38 

Cleric 

61 

1.22 

2.40 

kB. 

53 

1.22 

2.40 


Table 1: Some top words in the signature of the Dwarf. The second column is the frequency count, and 
the third column is the weight of the word, using Yarowsky’s function (wi) and Agirre’s function (w 2 ). 

b. I have opened an account in a bank. 

Again, General Named Entity Recognition can be considered as a more general task than Word 
Sense Disambiguation, where we have to find the synset whose meaning is the most similar among all 
the concepts in the whole taxonomy, instead than just the synsets containing that lexical word. 

General Named Entity Recognition is a task that covers, and is harder than both Named Entity 
Recognition and Word Sense Disambiguation. 


3 Word-sense disambiguation procedures 

A topic signature of a word w is the list of the words that co-occur with it, together with their 
respective frequencies or weights. It is a tool that has been applied to word-sense disambiguation with 
promising results [Yarowsky, 1992] [Agirre et al., 2000]. Because WordNet does not include topic signa¬ 
tures, we have used the method invented by Agirre et al. [2000] for collecting them, in an unsupervised 
way, from Internet. We will include here a brief description of the procedure. Except for some minor 
changes, the work described in this section has been done before. 

Agirre’s method consists of the following steps. For every WordNet synset s*, 

1. Generate a query containing all the words in s t and its hyponyms as positive keywords, and the 
words in other synsets that contain the same words as negative keywords. 

2. Submit the query to an Internet search engine, and collect the results. 

3. Download the documents, look for the synset words in them, and calculate the frequencies of the 
words that occur around them, in a context of width w. 

4. Store the list of words and frequencies, excluding the most common closed-class words (deter¬ 
miners, pronouns, conjunctions, etc). 

The following is an example of the WordNet synset for country (sense 06621523 in WordNet 1.7). 
state, nation, country, land, commonwealth, res publica, body politic -(a politically organized body 
of people under a single government; ’’the state has elected a new president"African nations”; "stu¬ 
dents who heid come to the nation’s capitol"; "the country’s largest manufacturer"; ”an industrialized 
land") 

The query that was produced is the following: 

“country” AND (“body politic" OR “commonwealth” OR “land” OR “nation" OR “res publica” 

OR “state” OR “Reich” OR “suzerain” OR “sea power” OR “great power" OR “major power” OR 
“power” OR “superpower” OR “world power” OR “city state” OR “ally”) AND NOT (“a people" OR 
“area" OR “rural area”) 

Next, the raw frequencies are changed into weights. The reason is that some words are equally 
frequent in all document collections, so they do not provide contextual support and can be ruled out. 
Furthermore, some document collections may be bigger than others, so a normalisation is required to 
give the same overall weight to all signatures. 

For every list of word frequencies /*, 


305 



attach(tt, C) 
u is the unknown synset, 

C is a collection of domain-specific documents. 

1. Calculate f u , the list of frequencies of words co-occurring with u, using the documents in C. 

2. Let r be the root synset in the ontology. 

3. return analyseLevel(J u , r) 

analyseLevel(/ u , c) 

l u is the unknown synset’s list of word frequencies, 
c is the candidate synset most similar to u. 

1. Get c’s synset children, {ci.ca, ...,Cn}. 

2. t e 4- c’s topic signature 

3. fe.f,.children’s topic signatures. 

4. Find the signature which is more similar to l u 

4.1. If that signature is t C} return c 

4.2 Let t Ci be the signature that scored better. 

4.3 return analyseLevel(f u , c<) 

Figure 1: Pseudocode of the algorithm for finding the correct place where the unknown synset u will be 
attached in the ontology 

• Transform the word frequencies into weights, and produce the topic signature ti. 

In our current work, we have used two weight functions, both of which have been already used for 
word-sense disambiguation. 

3.1 First weight function 

[Yarowsky, 1992]’s weight function is computed as follows: let’s suppose that we have several lists of 
word frequencies {h,...,i n }, counted from document collections that contain, respectively, the words in 
synsets {si,...,s n }. Then, the weight for each word is given by equation 1. 


log 


P(w|Sj) • P(8j) 
P(w) 


( 1 ) 


where P(tu) is the overall probability of a word; P(iw|s») is the probability of w given that it is in 
the context of a synset and P(st) is the probability that a word is in the context of s { . The first two 
probabilities are estimated from the document collections, and the.third one is assumed to be uniform. 


3.2 Second weight function 

The second weight function we have tested is the same that Agirre et al. [2000] used in their word-sense 
disambiguation experiments. If Wj is a word, and freq itj is its frequency in the frequency list I,, then its 
expected mean m* j is defined as 


rmj = 


Ei f re< ii,i • Sj f re QiJ 

E i tj f re( nj 


The weight for synset Sj in the topic signature U is then 


Wi,j 



if freqij > m itJ 
otherwise 


(2) 

(3) 


306 



3.3 Discussion 


Both functions have the desirable property that the weight associated to each word is a measure of 
the support that a word provides that we are in the context of a certain WordNet synset, regardless of 
the actual frequency values. So, if two words have appeared in the context of a synset with different 
frequencies, but neither of them appears in the context of any other synset, they both will score the 
maximum value of the weight, because they both are maximally supporting that synset. 

In our experiments, although the word weights and the similarity metrics are slightly different using 
both functions, they always produced the same synset classifications. 

Table 1 shows the signatures corresponding to the synset dwarf , and the weight values with both 
functions. 


4 Augmenting an ontology 

We use the notion that words semantically related must co-occur with the same kinds of words [Maedche 
and Staab, 2000J. In the same way that word co-occurrence information is useful for word-sense disam¬ 
biguation [Yarowsky, 1992] [Agirre et al., 2000], they can also be useful to calculate a degree of similarity 
between concepts and, therefore, to decide which concept in an ontology is the most similar to a new 
unknown concept u. All the work described from this section on is original. 

The procedure is detailed in Figure 1. It is a top down search starting at the most general concept 
in the taxonomy, which tries to find the concept whose topic signature is closest to the target concept. 

Let’s suppose that we have a domain-specific collection of documents and we find references to a 
concept u that is not present in our ontology. First, we compile the list l u of words that co-occur with 
that concept in the sample documents, and compute their frequencies. Next, at each level of the ontology, 
we find the concept whose topic signature is the most similar to l u . We may stop at the middle of a 
hierarchy if a concept’s signature scores higher than all of his children concepts’ signatures. 

4.1 Similarity metric 

Let ti be the topic signature of a concept, and l u be the list of frequencies of co-occurring words for the 
unknown concept. 

U = {< >,...,< W n , Win) >} 

l u = {< Wufi >,...,< WnJn >} 

where Wj is the j th word in the list, tt/y is its weight in the topic signature U, and fj is the frequency 
count in the contexts of u in the collection of domain-specific documents. 

Then, the similarity function we have used is the dot product of both vectors [Yarowsky, 1992]: 


Similarity (ti, l u ) = w *i ' U ( 4 ) 

j'=o 

Therefore, to find the concept that is most similar to the unknown concept u (step 4 in the algorithm) 
we have to find 


argma XiSimilarity(ti, l u ) (5) 

4.2 Adapting WordNet to the problem 

Before this procedure can be applied to WordNet, two changes are desirable. The first and more im¬ 
portant one is its enrichment with topic signatures. The procedure to do it is completely automatic 
and needs no human supervision, but we have not been able to finish it because downloading all the 
documents is rather slow, and there was not enough time for changing WordNet. 

Secondly, it would be desirable that each synset had a flag indicating whether it represents an instance 
or a concept. Because instances (e.g. Shakespeare) cannot have hyponyms, if they were marked the search 
space for our algorithm would be smaller. We describe a way to do this annotation in [Alfonseca and 
Manandhar, 2002]. 


307 



entity 


person location 

man dwarf fairy country 

Wales Spain 


Figure 2: Initial ontology in the preliminary experiments 


entity 


person location 

sim=285 sim=250 



man dwarf fairy 


sim—642 sim—555 sim=340 



person location 

sim=329 sim—373 


country 

sim=? 



Wales Spain 

sim=592 sim=253 


Figure 3: Decisions taken from classifying Hobbit 


(left) and Mordor (right). 


5 Preliminary experiments 

In the beginning, we tested our algorithm on small ontology, displayed here in Figure 2. 

The domain-specific documents we used consisted in an electronic version of The Lord of the Rings 
[Tolkien, 1968], which contains roughly 478,000 words. We chose the unknown concept hobbit and the 
unknown instance Mordor , both appearing in the text. 

Figure 3 shows the value of the similarity function, at each level, for classifying hobbit and Mordor. 
The first one was finally attached to man, although there was a strong evidence as well pointing to dwarf 
as a possible hypernym. Concerning Mordor , because Wales and Spain are instances, it is finally attached 
to country. Note as well that the algorithm did not calculate the similarity with country , because Mordor 
had been identified as being an instance, using evidence from the texts, and therefore the algorithm just 
proceeded downwards so as to leave it as a leaf in the hierarchy. 

The algorithm performed well with the small hierarchy that we produced for our experiments. How¬ 
ever, the first discrimination, that of deciding whether it was a person or a location was always a very 
narrow one. Because WordNet is a much more complex network, and there are nodes with hundreds of 
children (e.g. person), this first approach would probably need some fine-tuning and adjusting before it 
works properly. 


6 Final experiments 

The main problem we identified in the preliminary experiments was that person and location are far too 
general concepts, and their topic descriptions also contained many general terms, that were usually not 
very representative of the sub-concepts located below them. Therefore, we made another study where, 
for calculating the topic description of a concept, all the frequency lists of its sub-concepts were added 
up as well. For instance, to calculate the topic description of person, first the program added to its list 
of frequencies those of its sub-concepts dwarf, fairy and man. This produced much better results. 

Figure 5 shows the ontology used in our last experiments. It is slightly more complex than the 
previous one. The concepts we extracted from the domain-specific texts are hobbit, wizard, Mordor, eagle 
and horse. 

We performed several experiments, by varying the size of the context from which the word frequencies 
were calculated. As expected, the bigger the context, more words are used for the topic signature, which 


308 



location 


animate 


country 

Ireland 



person 



animal 


man dwarf fairy 


Figure 4: Ontology used in the final experiments. 


Setting 

Par/Sent 

Sent/Sent 

Sent/5wds 

Mordor 

location 

•man 

location 

Hobbiton 

location 

•animal 

location 

Isengard 

location 

•man 

location 

hobbit 

* animal 

man 

man 

wizard 

*animal 

man 

man 

eagle 

animal 

animal 

animal 

horse 

animal 

•man 

•man 

Accuracy 

71% 

h 43%~ 

86% 


Table 2: Classification of concepts hobbit, Mordor, wizard , horse and eagle with different context sizes. 
Wrong classifications are marked with an asterisk. Not only the last approach gave more correct classi¬ 
fications, but the similarity function gave more support to its decisions. 


is more complete, but that also introduces noise words and makes more difficult the classification. Table 2 
shows the classification of these new concepts in three different settings. In the first column, frequencies 
for the topic signatures were collected from the paragraphs that contained the concept (e.g. city, man, 
etc.) and frequencies for the domain-dependent concepts were collected from the sentences containing 
them. In the second column, both frequencies lists were taken using as context the sentences. In the 
third column, the frequencies for the domain-dependent concepts were collected using a context of five 
words at each side. In this last setting, not only more concepts were correctly classified, but also the 
similarity measure, at each decision step, supported with more strength the correct decisions. 

The resulting taxonomy was that of Figure 5. The only concept that was misclassified was horse. It 
is the case that the list of words and frequencies colected for horse contains words that also appear in 
the context of people, such as colors, verbs of movement, and some adjectives. 

Finally, Figure 6 shows the similarity values when locating the proper place for, the concept wizard. 

7 Discussion and Conclusions 

We have presented here an algorithm that is, to our knowledge, the only fully unsupervised method to 
extend an ontology with unknown concepts taken from domain-specific documents. It can be applied 


entity 


location 

country Mordor Hobbiton Isengard 
Ireland 


animate 


person 


animal 


man 


dwarf fairy eagle 


hobbit wizard * horse 


Figure 5: Resulting ontology after the best experiments. 


309 



entity 


location 

sim=8,44 


animate 

sim=37,32 


person 

sim=27,34 


animal 

sim=21,46 


man dwarf fairy stop 

sim=83,84 sim=31,48 sim=81,95 sim=63,27 


stop 

sim=0 


Figure 6: Values for classifying the new concept wizard. The similarity labelled as stop corresponds to 
the decision of stopping at that level in the tree, and attaching the new concept to the parent node (e.g. 
animate or person) 


to different languages and domains as it is. We believe that it will be able to tackle big ontologies 
once we have collected enough data in the topic signatures, and we have experimented more similarity 
functions and statistical models. It is highly versatile, and it Jillows the attachment of new concepts to 
any intermediate level in an ontology, not only at.the leaves. 

We have experimented several contexts for extracting word frequencies, and the conclusion is that it 
is better, in this task, to consider a small context of highly related words than a big context that includes 
more words. This is in contrast to Agirre et al. [2000], who used a context window of 100 words. 

In theory, this algorithm can also be used to create a new ontology from scratch. In this case, however, 
we must be careful that the concepts are learnt from the most general to the most specific one, because 
once a concept is attached to the hierarchy it is not possible to move it from its position. 

This work can also be used to test the degree of adequacy between existing ontologies, such as 
WordNet, and the usage of concepts in language. For instance, fairy and dwarf are considered, in 
WordNet, hyponyms of the concept psychological feature , and thence they are located far away from 
animate being ; but they are always used in language in the same way as animated beings, in the sense 
that they are usually selected by the same verbs and have similar complements. 

7.1 Improvements and future lines 

The following are only a few of the possible improvements that we may try with this procedure: 

• Try different similarity and weight functions. Combine them with other semantic distance metrics. 

• Try other features to measure similarity, or look for natural-language expressions that denote 
hyponymy. 

• Produce better topic signatures, using a bigger set of documents. 

• Use a beam search, looking for several candidate hypernyms at the same time, and finally decide 
between them. 

• Identify, from the evidence in the domain-specific texts, whether we are looking for an instance or 
a concept. If it is an individual, the search can be simplified because we know it can only be a leaf. 

8 Acknowledgements 

This work has been partially sponsored by CICYT, project number TIC2001-0685-C02-01. 

9 Authors affiliation 

Enrique Alfonseca is an assistant lecturer at the Computer Science Department, Universidad Autdnoma 
de Madrid, and a part-time research student at the University of York. Suresh Manandhar is a lecturer 
at the Computer Science Department, University of York. 

Contact*, {enrique, suresh}Ocs.york.ac.uk 


310 



References 


E. Agirre, 0. Ansa, E. Hovy, and D. Martinez. Enriching very large ontologies using the www. In In 
Proceedings of the Ontology Learning Workshop, ECAI, Berlin, Germany, 2000. 

E. Alfonseca and S. Manandhar. Distinguishing instances and concepts in wordnet. In Poceedings of the 
First International Conference on General WordNet, Mysore, India, 2002. 

J. Daud6, L. Padrd, and G. Rigau. Mapping wordnets using structural information. In 38th Anual 
Meeting of the Association for Computational Linguistics (ACL’2000), Hong Kong, 2000. 

D. Faure and C. Nedellec. A corpus-based conceptual clustering method for verb frames and ontol¬ 
ogy acquisition. In LREC workshop on Adapting lexical and corpus resources to sublanguages and 
applications, Granada, Spain, 1998. 

G. Grefenstette. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, 1994. 

J. Kietz, A. Maedche, and R. Volz. A method for semi-automatic ontology acquisition from a corporate 
intranet. In Workshop “Ontologies and text”, co-located with EKAW’2000 , Juan-les-Pins, French 
Riviera, 2000. 

Changki Lee, Geur.bae Lee, and Seo Jung Yun. Automatic wordnet mapping using word sense disam¬ 
biguation. In 38th Anual Meeting of the Association for Computational Linguistics (ACL’2000), Hong 
Kong, 2000. 

D. Lenat. Steps to Sharing Knowledge. Mars N., editor, Towards Very Large Knowledge Bases. IOS 
Press, 1995. 

D. B. Lenat and R. V. Guha. Building Large Knowledge-Based Systems. Addison-Wesley, Reading (MA), 
USA, 1990. 

C. Macleod and R. Grishman. Comlex syntax reference manual, 1994. 

A. Maedche and S. Staab. Discovering conceptual relations from text, 2000. 

A. Maedche and S. Staab. Ontology learning for the semantic web. IEEE Intelligent systems, 16(2), 

2001. 

M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english: the 
penn treebank. Computational Linguistics , 19(2):313-330,1993. 

H. Mifflin, M.A. Boston, R. Grishman, Catherine Macleod, and Adam Meyers. Comlex syntax: Building 
a computational lexicon. In Proceedings of COL1NG-94 , Kyoto, Japan, 1994. 

George A. Miller. WordNet: A lexical database for English. Communications of the ACM, 38(11):39—41, 
1995. 

D. O’Sullivan, A. McElligott, and R. F. E. Sutcliffe. Augmenting the princeton wordnet with a domain 
specific ontology. In Proceedings of the Workshop on Basic Issues in Knowledge Sharing at the 14th 
International Joint Conference on Artificial Intelligence, Montreal, 1995. 

G. Rigau. Automatic Acquisition of Lexical Knowledge from MRDs. PhD Thesis, Departament de 
Llenguatges i Sistemes Informatics - Universitat Polit&cnica de Catalunya. - Barcelona, 1998. 

J. R. R. Tolkien. The Lord of the Rings. Allen and Unwin, 1968. 

P. Vossen. EuroWordNet - A Multilingual Database with Lexical Semantic Networks. Kluwer Academic 
Publishers, 1998. 

Y. A. Wilks, B. M. Slator, and L. M. Guthrie. Electric words: Dictionaries, computers and meanings. 
Cambridge, MA: MIT Press, 1996. 

David Yarowsky. Word-sense disambiguation using statistical models of roget’s categories trained on 
large corpora. In Proceedings of COLING-92, pages 454-460, Nantes, France, 1992. 


311 



Automated Discovery of Telic Relations for WordNet 


Marco De Boni 
Suresh Manandhar 


Abstract 

A method is presented for automatically extending WordNet with the telic relationships proposed in 
Pustejovsky’s lexicon model. The method extracts telic relationships from WordNet glosses by first 
selecting a telic word through a pattern matcher aided by a part-of-speech tagger and then employing 
a word disambiguation module to select the specific meaning (synset) of the telic word. The method 
is shown to be fruitful, inferring a number of useful re lationships. 

1. Introduction 

WordNet (Miller 1995; Fellbaum 1998) is a lexical database which organizes words into synsets, sets 
of synonymous words, and specifies a number of relationships such as hypemym, synonym, 
meronym which can exist between the synsets in the lexicon. The explicit relationships in WordNet 
do not exhaust (nor claim to exhaust) the set of possible relationships between words and there is 
scope for expansion and improvement. Pustejovsky (1995), for example, presents a model in which 
each lexical item in a dictionary would be characterized by an argument structure (specifying the 
number and type of arguments that a lexical item carries), an event structure (characterizing the event 
type of a lexical item and its internal structure), qualia structure (representing the different modes of 
predication possible with a lexical item) and a lexical inheritance structure (identifying how a lexical 
structure is related to other structures in the dictionary). Each of these structures could then be used 
to infer a very complex (but also comprehensive) net of relationships between lexical items. Qualia 
structures, for example, would provide information on "constitutive" relationships (the relation 
between an object and its constitutive parts, e.g. material, weight, parts and component elements), 
"formal" relationships (that which distinguishes an object within a larger domain, e.g. orientation, 
magnitude, shape, dimensionality, colour, position), "telic" relationships (the purpose or function of 
an object ,e.g. the purpose an agent has in performing an act or the built-in function or aim which 
specifies certain activities) and "agentive" relationships (factors involved in the origin or "bringing 
about" of an object, e.g. creator, artifact, natural kind, causal chain). Of the relationships identified by 
Pustejovsky, WordNet partially considers argument structure (only in the case of verbs, as verb 
groups), inheritance structure (hyponym relationships, but not the "complex type" relationships 
identified by Pustejovsky) and qualia structure. In the case of qualia structure it only considers 
"constutive" relationships, in the form of meronym relationships (member, substance and part) and, 
in part, "agentive" relationships (in the form of entailment and causal relationship s). 

There is scope therefore for enhancing WordNet by adding relationships such as argument structure 
for words that are not verbs, event structure, complex inheritance, and qualia structures such as 
formal and telic relationships. 

This study will focus on telic relationships and in particular will examine a method by which telic 
relationships can be automatically discovered from the glosses contained in WordNet itself and used 
to augment WordNet itself. 

2. Telic Relationships 

The qualia structures identified by Pustejovsky are derived in part from the Aristotelian view of word 
meaning which identified a set of "modes of explanation" (aitiai) that could be applied to words. 



These modes of explanation identify particular aspects of meaning ( qualia ) which can be used to 
connect words in a lexicon. One particular aspect of meaning is the "telic" of an object, indicating the 
purpose or function of an object, for example the purpose an agent has in performing an act or the 
built-in function or aim which specifies certain activities. Thus the telic of milk would be drink , as 
the purpose of milk is to be drank. Although Pustejovsky never mentions multiple relationships, it is 
conceivable that a word may have more than one telic relation, as, for example, wood, used both for 
burning and for making furniture. 

The objective was to therefore to extend WordNet by creating a new set of relations telic( A, B ), 
linking two synsets and indicating that there exists a telic relationship between A and B, such that if 
A is a synset representing a word, B is the telic of A, or, in other words, that A is used to achieve B. 

3. Related Work 

Machine readable dictionaries and encyclopaedias have been shown to be useful tools in the creation 
of knowledge-bases. Different approaches have been applied, including pattern-matching (e.g. 
Chodrow et al., 1985), and specially constructed or broad coverage parsers (see for example Wilks et 
al. 1996; Richardson et al. 1998; Kang and Lee 2001; Katz et al. 2001). WordNet glosses, brief 
explanations describing the particular meaning of individual synsets within WordNet, have been 
successfully used to semi-automatically enhance and create knowledge bases. Moldovan and Rus 
(2001), and Harabagiu et al. (1999), for example, parsed the text of the glosses in order to transform 
them into a logical form to be used respectively as axioms in reasoning about world-knowledge and 
to enhance WordNet with new derivational morphology relations. Attempts have also been made to 
automatically build qualia relations from corpora (Pustejovsky et al. (1993); Bouillon et al. (2001)). 

4. Method 

In order to use the WordNet glosses to add telic relations to WordNet itself, it was necessary to: 

• Extract the telic relation (or, possibly, relations) from the gloss using some parsing method. The 
result of this process was expected be a word or a group of words representing the telic 
relation(s) for the synset which provided the gloss. 

• Transform the telic word into a synset by disambiguating its meaning, thus avoiding the creation 
of misleading relationships 

In the present experiments, pattern-matching, enhanced by a very simple part-of-speech parser was 
used to find the relevant information, i.e. a word representing the telic of the given synset. The 
extracted information was then passed to a word disambiguation module which returned a synset for 
the given telic word. 

4.1. Identification and Extraction of Telic Words 

Initially the WordNet glosses were cleaned, removing example sentences. The glosses were then 
analyzed for patterns indicating that the gloss contained information about the telic relations for a 
particular synset. It was noted that within the given glosses telic relations could be found in the 
presence of the following patterns: 

"... to TELIC_VERB by the use of...", as in mammography ( synset id 100649306), which has 
as gloss "a diagnostic procedure to detect breast tumors by the use of X rays", indicating that 
the telic relation for mammography is to detect breast tumors (i.e. it is used to detect breast 
tumors). 


313 



"... used for TELIC" , as in tracing_paper (synset id 110816432), with gloss "a 
semitransparent paper that is used for tracing drawings", indicating that the telic relation for 
tracing paper is to trace drawings. 

"... used to TELIC", as in cardiacjglycoside (synset id 110805579), defined as "obtained 
from a number of plants and used to stimulate the heart in cases of heart failure", indicating 
that the telic relation for cardiac glycoside is the stimulation of the heart. 

"... use of... to TELIC" as in trickery (synset id 100485559), with gloss "the use of tricks to 
deceive someone (usually to extract money from them)", indicating that the telic of trickery is 
to deceive someone. 

"... used as ... in TELIC_ING-VERB" as in Plasticine (synset id 110453999), defined as "a 
synthetic material resembling clay but remaining soft; used as a substitute for clay or wax in 
modeling (especially in schools)", indicating that the telic relation for Plastcine is modeling. 

"... used in TELIC_ING-VERB" as in seal_oil (synset id 110781016), with gloss "a pale 
yellow to red-brown fatty oil obtained from seal blubber; used in making soap and dressing 
leather and as a lubricant", indicating that the telic relations for seal_oil should be making 
soap, dressing leather and lubrification. 

"... used in ... as a TELIC" as in giantJaro (synset id 108093257 ) with gloss "large 
evergreen with extremely large erect or spreading leaves; cultivated widely in tropics for its 
edible rhizome and shoots; used in wet warm regions as a stately ornamental", indicating that 
the telic relation of a giant taro is its use as a stately ornamental. 

"... for use as TELIC" as in houseboat (synset id 102838388) defined its gloss as "a barge that 
is designed and equipped for use as a dwelling" indicating that the telic relation for a 
houseboat is a dwelling. 

"... for use in ... TELIC_ING-VERB" as in wherry (synset id 103611080), with gloss "light 
rowboat for use in racing or for transporting goods and passengers in inland waters and 
harbors", indicating that the telic relation for a wherry is racing and the transportation of 
goods and passengers. 

A subset of the WordNet glosses possibly containing information regarding telic relations was 
therefore taken and each gloss was split where more than one telic relation was indicated, as in the 
presence of conjunctions or disjunctions (as in synset 103357011, "a sailing ship [...] used in fishing 
and sailing along the coast", where two telic relations, a) fishing, and b) sailing, are present) and 
semicolons (as in synset 110842812, "any of a group of synthetic steroid hormones used to stimulate 
muscle and bone growth; sometimes used illicitly by athletes to increase their strength" where two 
telic relations are present, a) the stimulation of muscle and bone growth, and b) to increase strength). 

It was then necessary to identify one word (or compound word) that could summarize the telic 
relationship found in the gloss. In particular, it was necessary to identify one verb or noun that would 
represent the telic for a chosen synset. In order to avoid over-generalization, words such as "be", 
"do", "make" and "thing" were avoided and where these were found, a more specific word was 
sought. So, for example, in seeking the telic relationship for conditioner (synset id 102485262), 
defined as "a substance used in washing (clothing or hair) to make things softer", the relationship that 
was sought was to "make soft", i.e. to soften, not simply to "make", which is far too general to be of 
any use. In these cases a more specific noun or verb was sought and, in the absence of a more 
specific noun or verb the adjective attached to the noun was modified to find a more specific verb (in 
the case of conditioner, the adjective "soft" was used to derive the verb "soften"). 

4.2. telic Word sense Disambiguation 

Having identified one word that summarized the telic relationship, it was necessary to identify the 
specific sense of the telic word, i.e. to find the synset which represented the telic relationship. It was 
therefore necessary to consider the set of possible synsets to which the telic word could belong and 


314 



choose the synsct that best represented the meaning of the telic word. One approach to 
disambiguation of word sense is to use some measure of relatedness between the word and its 
context, or, in other words, to calculate the semantic similarity or conceptual distance between the 
word and its context (Miller and Teibel 1991, Rada et al. 1989). WordNet has been shown to be 
fruitful in the calculation of semantic distance, determining similarity by calculating the length of the 
path or relations connecting two concepts; different approaches either using all WordNet relations 
(Hirst-St-Onge (1998) or only is-a relations (Resnik (1995); Jiang-Conrath (1997); Lin (1998); 
Leacock-Chodorow (1998)) have been proposed (for an evaluation see Budanitsky and Hirst (2001)). 
In determining conceptual distance (also referred to as conceptual density), Mihalcea and Moldovan 
(1999) and Harabagiu et al. (1999) found WordNet glosses, considered as micro-contexts, to be 
useful. A similarly inspired approach was taken in this study, with the meaning of a telic word being 
constrained by a) the gloss from which it was extracted and b) the set of glosses of the synsets to 
which it could belong. 

Therefore, given a word w, whose meaning was represented by the gloss (definition) GW w , and its 
telic word /, it was necessary to find the synset ts representing the correct meaning of t from the set of 
all synsets T to which ts could belong. The correct synset ts (in other words, the correct sense) for the 
telic word / was taken to be the synset that maximized the semantic distance between GW w and ts , as 
follows: 


ts = Argmax tte r^( GW w , ts ) 

where sd( a, s ) is a function calculating the semantic distance sd between a sentence a and a 
particular meaning of word w, represented by its synset s, which returns a number between 0 and l 
indicating the relatedness between the sentences, where 1 indicates they have the same meaning and 
0 indicates they have no meaning in common. 

In order to calculate the semantic distance sd the set TSta was constructed by taking all the words in 
the gloss GT of ts and all the words in the glosses of all the hyponyms and hypemyms of ts (to a 
depth of 3 hypemyms and 3 hyponyms) as follows: 

TS, s = {w : weGTv 

w e hyperg( ts, 3 ) v 
w e hypog{ ts, 3 )} 

Where w represents a word; hyperg( s, d) is a function which returns a set made up of all the words in 
the glosses of the hypemyms of a synset s, to a depth d\ and hypog( s, d) is a function which returns a 
set made up of all the words in the glosses of the hyponyms of a synset s, to a depth d. 

TS was then compared with GW w (which was considered the set of words making up a gloss) by 
using a form of term overlap measure to measure their semantic relatedness. Initially, a set of stop- 
words SW was used to ignore words that were too common to be of any use (e.g. "the", "do") thus 
producing the two reduced sets RGW w and RTS: 

RGW w = GW w - SW 
RTS = TS - SW 

The remaining words were then analyzed to find their stems thus producing the two sets SGW w and 
STS made of the stems of the words belonging to RGW w and RTS. 


315 



Each word in RGW w was then compared to all the words in RTS, using all the available WordNet 
relationships (is_a, satellite, similar, pertains, meronym, entails, etc.), with the additional 
relationship, "same_as", which indicated that two words were identical. Each relationship was given 
a weighting indicating how related two words were, with a “same as” relationship indicating the 
closest relationship, followed by synonym relationships, hypemym, hyponym, then satellite, 
meronym, pertains, entails. Each word w, in RGW w was therefore assigned a weighting r t indicating 
its relatedness to RTS, and the total semantic distance tsd between RGW w and RTS was calculated as 
the normalized sum of all the weightings r of RGW w . The normalization was carried out by dividing 
by the number of words in the gloss by sum of the result + 1, in order for short glosses not to be 
disadvantaged. 


tsd = |RTS| / ((S weRGW r( w, RTS )) + 1) 

A number of experiments were conducted to see to what depth the hyper- and hyponyms of a 
candidate synset should be considered, i.e. to decide, given a synset S, and its hyponyms HS, if the 
hypemyms HS' of the set HS also be considered, if the hypemyms HS" of HS' should be considered, 
and so forth to an arbitrary depth n. It was found that a depth of 3 hypemyms and 3 hyponyms gave 
satisfactory results in an acceptable time; however further research would be necessary to optimize 
this parameter. 

5. Results 

2449 telic relationships were derived, relating to 1841 different synsets (i.e. a synset could have more 
than one telic relationship). A sample (about 10% of the total) of the derived relationships was 
examined manually and it was estimated that 78% of the relationships were actually telic 
relationships, while the rest either denoted other types of relationships (e.g. the context in which 
something is used), or denoted telic relationships that could not be summarized in one word (as in 
synset 110556533, "activating agent", whose telic should be "increase the attraction to a specific 
mineral"). 9% of the relationships were in effect telic relationships, but were counterintuitive, in part 
because of the limitations posed by the adopted method, which considered telics in isolation from 
their possible objects (e.g. a cancer drug having as telic "kill", because its function is to kill cancer 
cells). Of the correct relationships, the correct synset (i.e. the correct meaning) was chosen 77% of 
the time. The disambiguation algorithm failed mainly where there were very subtle differences in 
meaning in WordNet, as in the difference in meaning for the word "represent" between synset 
201841374 (take the place of), 200566766 (express indirectly; be a symbol of) and 200265192 (to 
establish a mapping (of mathematical elements or sets)), or the difference in meaning for the word 
"stain", between synset 200196870 (produce or leave stains; "Red wine stains table cloths") and 
201053918 (make a spot or mark onto; "The wine spotted the tablecloth"). 


Total relationships found 

2449 

Number of different synsets 

1841 

Actual telic relationships 

78% (1910) 

Of which with correct synset 

77% (1470) 


A number of useful telic relationships were derived: 

Example 1 : from synset 102853717, indicating "incubator, brooder", defined as "a box designed to 
maintain a constant temperature by the use of a thermostat; used for chicks or premature infants" it 
was derived that incubators are used for (or: the telic of an incubator is) maintaining a constant 
temperature: the telic word was therefore correctly identified as "maintain", with the particular 
meaning given by synset 201829600, i.e. "keep in a certain state, position, or activity", which was 


316 



correctly chosen by the algorithm in preference to, for example, synset 200723279 (maintain by 
writing regular records), synset 200607420 (support against an opponent) and 200496801 (observe 
correctly). 


Example 2: from synset 100450328, indicating "desensitization technique, desensitization procedure, 
systematic desensitization", defined as "a technique used in behavior therapy to treat phobias and 
other behavior problems involving anxiety; client is exposed to the threatening situation under 
relaxed conditions until the anxiety reaction is extinguished", the algorithm correctly inferred that the 
telic of desensitization technique was "treat" in the particular meaning given by synset 200054862, 
i.e. "provide treatment for" as in "The doctor treated my broken leg"; this meaning was chosen if 
favour of incorrect alternatives such as 201547305 (provide with a treat) and 200699711 (deal with 
verbally or in some form of artistic expression). 

Example 3: from synset 102399372, indicating "cash_register, register", "a cashbox with an adding 
machine to register transactions; used in shops to add up the bill" it was correctly inferred that the 
telic was 201805970, "add up", with the meaning "add up in number or quantity", and not "add up" 
as in synset 201786912, "be reasonable or logical or comprehensible" or in synset 201792159, 
"develop into". 

6. Conclusions and further work 

WordNet contains a significant amount of implicit information in the form of synset glosses. This 
study has shown how this implicit information can be made explicit by automatically extracting new 
relationships. In particular, the algorithm presented usefully extended WordNet by automatically 
inducing a number of telic relationships from the glosses. 

Future work will involve a manual review of the relationships found to ensure that the derived 
relationships are of sufficient high-quality to be used in practice. It will be also necessary to address 
the limitations of the method of representation chosen, which constrained telic relationships to be 
between two words as opposed to relationships between a word and a sentence: in a number of 
instances a single word was not enough to represent a telic relationship. Another problem that needs 
to be tackled is the fine granularity of meaning in WordNet, which in some cases made it very 
difficult to choose a particular synset (meaning) for a telic relationship. Another direction for future 
work will be the application of the proposed methodology to derive telic relationships from other 
machine readable dictionaries. A further interesting direction for research would be exploring the 
possibility of moving on from machine readable dictionaries in order to derive telic relationships 
from generic corpora. 

References 

Bouillon, P., Claveau, V., Fabre, C., Sebillot, P. (2001) “Usingpart-of-speech and semantic tagging 
for the corpus-based learning of qualia ”, in Bouillon and Kanzaki (eds.), Proceedings of the First 
International Workshop on Generative Approaches to the Lexicon, Geneva. 

Budanitsky, A., and Hirst, G. (2001) “ Semantic distance in WordNet: and experimental, application- 
oriented evaluation of five measures ”, in Proceedings of the NAACL 2001 Workshop on WordNet 
and other lexical resources, Pittsburgh. 

Chodrow, M., Byrd, R., Heidom, G. (1985) " Extracting semantic hierarchies from a large on-line 
dictionary ", In Proceedings of the 23 rd Annual Meeting of ACL. 

Fellbaum, C. (1998) "WordNet, An electronic Lexical Database", MIT Press. 

Harabagiu, S. A., Miller, A. G., Moldovan, D. I. (1999) " WordNet2 - a morphologically and 
semantically enhanced resource ", In Proceedings of SIGLEX-99, University of Maryland. 


317 



Hirst, G., and St-Onge, D. (1998) “ Lexical chains as representations of context for the detection and 
correction of malapropisms”, in Fellbaum (ed.), “WordNet: and electronic lexical database”, MIT 
Press. 

Jiang, J. J., and Conrath, D. W. (1997) “ Semantic similarity based on corpus statistics and lexical 
taxonomy ”, in Proceedings of ICRCL, Taiwan. 

Kang, S-J, Lee, J-H. (2001) " Semi-Automatic Practical Ontology Construction by Using a 
Thesaurus, Computational Dictionaries and Large Corpora ", Proceedings of the Workshop on 
Human Language Technology, ACL-2001, Toulouse. 

Katz, B., Lin, J., Felshin, S. (2001) " Gathering Knowledge for a Question Answering System from 
Heterogeneous Information Sources ", Proceedings of the Workshop on Human Language 
Technology, ACL-2001, Toulouse. 

Lin, D. (1998) “An information-theoretic definition of similarity”, in Proceedings of the 15 th 
International Conference on Machine Learning, Madison. 

Miller, G, and Teibel, D. (1991) "A proposal for lexical disambiguation ", in Proceedings of DARPA 
Speech and natural Language Workshop, California. 

Miller, G. A. (1995)" WordNet: A Lexical Database ", Communications of the ACM, 38 (11). 
Moldovan, D, and Rus, V. (2001) " Logic Form Transformation of WordNet and its Applicability to 
question Answering ", in Proceedings of ACL-2001, Toulouse. 

Pustejovsky, J., Bergler, S., Anick, P. (1993) “ Lexical Semantic techniques for corpus analysis ”, 
Computational Linguistics, 19 (2). 

Pustejovsky, J. (1995) "The Generative Lexicon" , MIT Press. 

Rada, R., Mili, H., Bicknell, E. and Blettner, M. (1989) "Development and application of a metric on 
semantic nets", in IEEE Transactions on Systems, Man and Cybernetics, vol.19, n.l. 

Rada Mihalcea and Dan Moldovan. (1999) “A Method for Word Sense Disambiguation of 
Unrestricted Text ”, in Proceedings of ACL '99, Maryland, NY. 

Resnik, P. (1995) “ Using information content to evaluate semantic similarity ”, in Proceedings of the 
14* IJCAI, Montreal. 

Richardson, S. D., Dolan, W. D., Vanderwende, L. (1998) " Mindnet: acquiring and structuring 
semantic information from text", m Proceedings of COLING-98. 

Wilks, Y. A., Slator, B. M., Guthrie, L. M. (1996) 'Electric Words: dictionaries, computers and 
meaning ", MIT Press. 

:•••: • t:r. ■ 

Marco De Bonl, Department of Computer Science, University of York, York YO10 5DD, United Kingdom, 
mdeboni@cs.vork.ac.uk 

Suresh Manandhar, Department of Computer Science, University of York, York YO10 5DD, United Kingdom, 
suresh@cs.vork.ac.uk 


318 



Nouns in WordNet and HowNet: 
an Analysis and Comparison of Semantic Relations 


Ping Wai Wong 
Pascale Fung 


ABSTRACT 

This paper compares and analyses the semantic relations of nouns in WordNet and HowNet. Their 
main difference lies in the theory of meaning representation. WordNet, adopting the differential 
approach, uses synsets to differentiate one concept from the other. HowNet, using the constructive 
approach, uses sememes (the basic unit of meaning) to build up the meaning of a concept. 
Regardless of the difference in meaning representation, the semantic relations of nouns are quite 
similar in WordNet and HowNet. For example, hyponymy, synonymy, antonymy, meronymy and 
value-attribute relations are represented in both. The differences will be described m detail m the 

paper. 

Introduction 

WordNet (Miller, 1990; Miller and Felbaum, 1991; Fellbaum, 1998) is an on-line lexical database in 
which English nouns, verbs, adjectives and adverb are organized in terms of semantic relations such as 
synonymy antonymy, hyponymy and meronymy. Such a lexical system was lacking in Chinese until 
the release’of HowNet in 1999. HowNet (Dong, 1988) is not just a Chinese version of WordNet. It has 
its own structure in describing inter-concept relations and inter-attribute relations of concepts. The 
encoded semantic relations of WordNet and HowNet may well complement each other and it is possible 
to enrich HowNet by aligning it with WordNet and taking advantage of the glossed lexicon and 
ontology of HowNet (Carpuat el al 2002). The alignment task, however, may be complicated by the 
different approaches taken in constructing WordNet and HowNet. Thus, this paper aims to give a 
preliminary analysis and comparison on the semantic relations encoded in HowNet and WordNet. 

The latest version of HowNet covers over 65,000 concepts in Chinese and close to 75,000 English 
equivalents. Concepts are defined by a set of 1503 sememes, which are organized in an ontological tree. 
Its top-most level is: Entity, Event, Attribute and Attribute Value. In general, adjectives belong to the 
class of Attribute Value, and verbs belong to the class of Event. A noun in Chinese always falls into the 
classes of Entity or Attribute in HowNet though there are a few exceptions in English. (For example, the 
nominalization of adjective, like “happiness” in English is categorized as attribute value.) Nouns 
express nominal concepts that are used to name a person, animal, place, thing, or abstract idea. They are 
the majority of concepts in both WordNet and HowNet, which involve a large variety of semantic 
relations, and thus is our focus in this paper. 

WordNet and HowNet share similar idea in the definition of nouns though they differ in how meaning is 
represented. As mentioned in Miller (1993), the definition of a common noun typically consists of (i) its 
immediate superordinate term and (ii) some distinguishing features. These two components are used in 
the definition of nouns in WordNet and HowNet. Superordinate terms (hypemyms) are organized in 
hierarchical structure, in which the subordinates (hyponyms) inherit the distinguishing features of the 
superordinates (Miller, 1993). Hypemym gives a general classification of a concept and the 
distinguishing features provide more specific information to distinguish one concept from the other 
(Examples are given in Sections 1 and 2.2). The distinguishing features discussed here are such 
semantic relations as meronymy, value-attribute and event role (“functions” as discussed in Miller, 
1993), of which the detail will be illustrated in section 2. 

1 Difference in meaning representation 

HowNet differs from WordNet in meaning representation. As mentioned in Miller et al (1993), 
meaning representation is either constructive or differential. HowNet uses the former representation 
whereas WordNet uses the latter. WordNet, adopting the differential approach, relies on the device that 



enables one to differentiate one concept from the other. Perhaps synonym plus some explanatory 
glosses can be a sufficient means to this end. Take as example the two senses represented by the word 
form end, it can either mean “the concluding part of an event” or “the goal intended to be achieved , 
which are represented by the two synsets {end, last} and (end, goal) in WordNet. The synonyms last 
and “goal” help us to distinguish the difference between these two senses. Concepts are represented by 
synsets in WordNet, but represented by a combination of sememes and pointers in HowNet. Take the 
concept “teacher” as an example: 

(1) Meaning representation in WordNet - svnset : 

(teacher, instructor} - (a person whose occupation is teaching) 

(2) Meaning representation in HowNet - combination of sememes and pointers: 

Concept: (teacher) DEF=human|A, *teach®, education®W 

WordNet uses the synset (teacher, instructor} to represent the concept “teacher”. HowNet decomposes 
this concept into sememes ‘human|A\ ‘teach®’, and * educationistW’, and uses poi nter * t0 
express the semantic relation between the concept “teacher” and the event ‘teach® . The sememe 
appearing in the first position of “DEF”, for example, 4 human|A* is the categorical attribute, which 
names the hypemym of the concept “teacher”. Those sememes appearing in other positions, for 
example, ‘teach®’, and ‘education)are additional attributes, which give more specific 
information to the concept: The sememe without a pointer - ‘education® W is the specific attribute 
value of the concept “teacher”. The one with the pointer “*” represents an event role relation, as 
- illustrated in 2.2, which states that the function of teacher is the agent of ‘teach’. In HowNet, a pointer 
expresses a relation between a concept (e.g. “teacher”) and a sememe (e.g. ‘teach® ). The semantic 
relations between concepts are linked by the pointers and the sememes encoded in concept definitions 
(e.g. the relations of “teacher” and “pupil” to be illustrated in section 2.3). The case is simpler in 
WordNet, in which a pointer states explicitly the relation between one concept/synset and the other. 

2 Semantic relations 

Nouns in WordNet are coded with semantic relations such as synonymy, antonymy, hyponymy, 
meronymy and value-attribute relations. Nouns in HowNet involve ten semantic relations, namely, 
hyponymy, synonymy, antonymy, converse, part-whole, material-product, attribute-host, 

value-attribute, concept co-occurrence and event roles relations. * 

2.1 Synonymy 

Synonyms in WordNet are grouped into synsets and an example is shown in (3). In HowNet, synonyms 
are expressed by identical definitions, for example, (clothes) have such synonyms as » 

“M”, which have the same definition, as shown in (4). 

(3) Synset in WordNet: (clothing, clothes, apparel, vesture, wearing apparel, wear} 

(4) Synonyms in HowNet (The definitions are identical), for example: 

Concepts: 2c, (clothing) DEF=clothi:ng|2c%, generic|&£#§ 

2.2 HyponymylHypernymy and event role relation 

Both hypernyms and hyponyms are coded in WordNet whereas only hypemyms are coded in HowNet. 
They both have “Entity” as the top node of the hierarchy but the hierarchy is much more detailed in 
WordNet. For example, in HowNet, both “»®” (teacher) and (researcher) have ‘human! A’ 

as the hypemym, as shown in the categorical attribute of (2) and (5). Their differences are indicated by 
additional attribute - their functions as revealed by event role, relations: “teacher” plays the agent role 
of ‘teach’ (as denoted by *teach®) whereas “researche r” plays the agent role of research (as stated by 
*research|#f?5). ‘teach®’ and ‘research|ff5u’ are events denoted by verbs that are also organized 
hierarchically in HowNet. 

(2) Concept: (teacher) DEF=human|A, *teach®, education®^ 

(5) Concept: (researcher) DEF= human|A, *research|#f?£ 

The linkage of nominal concepts with verbal concept helps to define nominal concepts, and such 
linkage is called event role relation in HowNet. In WordNet, functions of a noun are described in the 


320 






explanatory gloss but not coded (Miller 1993). The difference between the concepts of “teacher” and 
“researcher” is reflected by the hierarchies of nouns only, as shown below: 

(6) The hypemvms of “teacher” and “researcher” in WordNet 

teacher -> educator -> professional -> adult -> person life form -> entity 
researcher scientist intellectual person life form entity 

2.3 Antony my and converse relation 

Antonymy is rather poorly represented in nouns in WordNet (Vider, 1996), compared with synonymy 
and hyponymy. The same goes for HowNet. In HowNet, antonymy is limited to the class of Attribute 
Value while the class of event may have converse opposites. Nouns in Chinese seldom fall in the 
category of attribute value, but a few nouns in English do, such as happiness!sadness, which derive 
from antonymous adjective (Miller, 1993). Antonymy in WordNet covers the three types of opposites - 
complementary antonymy (e.g. alive/dead), gradable antonymy (e.g .fast/slow) and converse opposites 
(e g. give/receive). HowNet differs from WordNet in that converse and complementary antonymy, are 
grouped into converse , whereas gradable antonymy is named antonymy. The following shows an 
example of HowNet’s converse opposites - teacher/pupil. They are antonyms that are explicitly coded 
in WordNet, but there is no explicit relation in HowNet. 

(7) Concept: tfcgiji (teacher) DEF=human|A, *teach||fc, education|££|f 

(8) Concept: (pupil) DEF-human[A, *study|^, education) 

As shown in the concept definitions above, ‘*teach|tfe’ indicates that “teacher” is the agent of the event 
“teach” while ‘ *study|^’ denotes that “pupil” is the agent of the event “study” (the verbs learn and 
study denote the event “study” in HowNet). There is a converse relation between the events of “teach) 
and “study and thus linking the two participants of the events - teacher and pupil. In WordNet, 
semantic relations are explicitly coded between the nouns pupil and teacher (antonymy) and between 
the verbs learn and teach (teach causes learn). 

2.4 MeronymylHolonymy 

There are three types of meronymy in WordNet - part-whole, material-product, and member-collection 
relations. HowNet includes part-whole and material-product relations but does not cover 
member-collection relation. 

2.4.1 Part-whole relation 

WordNet has a more detailed analysis of part-whole relation, compared with HowNet. For example, 
“computer” has a lot of parts such as keyboard, monitor, CPU, cache, etc and those parts still have 
smaller parts. HowNet only codes one level in part-whole relation, so the parts of (computer), 

like keyboard, monitor and CPU, are not further divided. The functions of parts in whole are analogous 
to the human body in HowNet. For example, “ffil” (river head) and “jHU” (mouth of the river) are 
analogous to such parts in human body as the head and the mouth. 

2.4.2 Material-product relation and member-collection relation 

Material-product relation or stuff-object relation is coded in WordNet and HowNet. For example, in (9), 
the pointer “?” states that ‘clothing|^c$f is the product made of the material ‘&®!yam\ 

(9) Concept: (yam) DEF= materiality, ?clothing|^c^ 

WordNet includes member-collection relation, for example, “tree” is coded as a member of “forest”. In 
HowNet, “tree” and “forest” have the same hypemym - ‘tree)$f and there is no device to express their 
member-collection relation. 

2.5 Value-attribute relation and attribute-host relation 

Values of attributes are expressed by adjective. Adjectives like red are the attribute value of color. 
Color is an attribute of physical things, (for mstancQ,floxvers and birds,) as defined by C &physical|^^’ 
in HowNet. Bird is a hyponym of ‘physical|%^f ’ and thus inherits this distinguishing feature. 



Attribute Value 

Attribute 


Concept 

u fcT (red) 

“Hfe” (color) 

ora 

DEF 

aValue|jR1£fj|,color|§|fe,red|feL 

Attribute|iH'f4,color|^'fe,&physical|1^^ 

Mllbird 









The adjective red can modify the noun bird because it has the corresponding attribute co/or with/Was 
a value (Fellbaum el al, 1993). Value-attribute relation (for instance, red/color) ts coded in HowNet and 
WordNet. Attribute-host relation is coded in HowNet (with the pointer &) but not in Word e 
Pertainym (for instance, former y latter) is a type of relational adjective in WordNet. It points to a noun 
but does not refer to an attribute value (Fellbaum etal , 1993). HowNet does not have this relation since 
there is not a direct semantic relation between modifier (e.g. red) and modified (e.g. bird). 

2.6 Concept co-occurrence 

Concept co-occurrence relation is not captured by WordNet but is coded in HowNet. In thefdlowmg 
example, the typical context of concept “jSjK” (scar) is associated with the event woundU® . This 
co-occurrence relation is expressed by the pointer in HowNet: 

(10) Concept: fl-H (scar) DEF=trace®, #wounded|§1I 

The "srarantic delations of nouns covered in WordNet and HowNet are similar ^^dlessof their 
different approaches of constructing word meaning. The semantic relations in common a y y y. 
antonymy, hyponymy, part-whole, material-product and value-attnbute relaUons. Wien thehnk^e 
between HowNet and WordNet is established, there are a few complementary se ™ a f re ‘f °" S w L t 
example WordNet’s pertainymy and member-collection relations can be used to enhance 
while HowNet’s attribute-host, concept co-occurrence and event role relations can beuse o* 
WordNet. In this paper, we have presented a prehmmary analysis of the semantic relat ons of nouns 
HowNet and WordNet. We plan to extend the analysis to other syntactic categories such verb an 
adjective in future, investigating the semantic linkage between 

especially rich in the linkage of nominal concept with verbal .concept (for examp e ^nt ole relat.o^ 
discussed in 2.2, “converse relation” in 2.3 and “co-occurrence relation in 2.6) and we believe 
can shed light on the future development of semantic linkage in WordNet. 

We me grated te^Kok-wee Gan, Dr. Endong Xun and Dr. Grace Ngai for many helpful discussions 
and comments. 

Ca^fatM ^ Ngai P Fung and K. Church (2002). Creating a Bilingual Lexicon: A Co^M-toed^PprOflcA 
7or Al^mg wtriL an! HowNel. To appear in the 1st International Conference on Global WordNet, 
21st-25th, January 2002, Mysore, India. 

S&™ * —° f “ nal 

FeU^um^'c^fed (T^^j'w^rdt^^^^l^ctronic^hxical database. Cambridge, Massachuse.s: MIT Pres, 
FdlbS Chrisdane! Derek Gross and Katherine Miller (1993) Adjectives in WoriNei. F.ve Papers on WordNet, 

M r°Geo n “(19in: Special Issue of Internationa. Journal of 

^ *“ em - Five Papers °" WordNet ’ CSL 

Science Laboratory, Princeton University. _ 

Palmer, F. R. (1993) Semantics. Cambridge University Press. 


322 



Integrating Selectional Preferences in WordNet 


Eneko Agirre 
David Martinez 


Abstract 

Selectional preference learning methods have usually focused on word-to-class relations, e.g., a verb 
selects as its subject a given nominal class. This paper extends previous statistical models to class-to- 
class preferences, and presents a model that leams selectional preferences for classes of verbs, 
together with an algorithm to integrate the learned preferences in WordNet. The theoretical 
motivation is twofold: different senses of a verb may have different preferences, and classes of verbs 
may share preferences. On the practical side, class-to-class selectional preferences can be learned 
from untagged corpora (the same as word-to-class), they provide selectional preferences for less 
frequent word senses via inheritance, and more important, they allow for easy integration in WordNet. 
The model is trained on subject-verb and object-verb relationships extracted from a small corpus 
disambiguated with WordNet senses. Examples are provided illustrating that the theoretical 
motivations are well founded, and showing that the approach is feasible. Experimental results on a 
word sense disambiguation task are also provided. 

1. Introduction 

Previous literature on selectional preferences has usually learned preferences for verbs in the form of 
classes, e.g., the object of eat is an edible entity. This paper extends previous statistical models to 
classes of verbs, yielding a relation between classes in a hierarchy, as opposed to a relation between a 
word and a class. 

The model is trained using subject-verb and object-verb associations extracted from Semcor, a corpus 
(Miller et al., 1993) tagged with WordNet word-senses (Miller et al., 1990), comprising around 
250,000 words. The syntactic relations were extracted using the Minipar parser (Lin, 1993). A 
peculiarity of this exercise is vhe use of a sense-disambiguated corpus, in contrast to using a large 
corpus of ambiguous words. This corpus makes it easier to compare the selectional preferences 
obtained by different methods. Nevertheless, the approach can be easily applied to larger, non- 
disambiguated corpora. 

This paper argues that class-to-class selectional preferences are a better formalization than verb-to- 
class models. An algorithm to integrate the acquired selectional preferences in WordNet as relations 
holding between synsets is provided. Some examples are described, as well as the results in a word 
sense disambiguation (WSD) exercise. 

Following this short introduction, section 2 reviews selectional preference acquisition literature. 
Section 3 explains our approach, and the estimation of class frequencies is described in Section 4. 
Section 5 presents the algorithm for the integration in WordNet. Section 6 comments some examples 
of the acquired selectional preferences. Section 7 shows the results on the WSD experiment. Finally, 
some conclusions are drawn and future work is outlined. 

2. Selectional Preference Learning 

Selectional preferences try to capture the fact that linguistic elements prefer arguments of a certain 
semantic class, e.g. a verb like ‘ eat ’ prefers as object edible things, and as subject animate entities, as 
in, (1) “She was eating an apple". Selectional preferences get more complex than it might seem: (2) 
“The acid ate the metar\ (3) “This car eats a lot of gas ”, (4) “We ate our savings ”, etc. 

Corpus-based approaches for selectional preference learning extract a number of (e.g. verb/subject) 



relations from large corpora and use an algorithm to generalize from the set of nouns for each verb 
separately. Usually, nouns are generalized using classes (concepts) from a lexical knowledge base 
(e.g. WordNet). 

Resnik (1992, 1997) defines an information-theoretic measure of the association between a verb and 
nominal WordNet classes: selectional association. He uses verb-argument pairs from the Brown 
corpus. Evaluation is performed applying intuition and WSD. Our measure follows in part from his 
formalization. 

Abe and Li (1996) follow a similar approach, but they employ a different information-theoretic 
measure (the minimum description length principle) to select the set of concepts in a hierarchy that 
generalize best the selectional preferences for a verb. The argument pairs are extracted from the WSJ 
corpus, and evaluation is performed using intuition and PP-attachment resolution. 

Stetina et al. (1998) extract word-arg-word triples for all possible combinations, and use a measure of 
“relational probability” based on frequency and similarity. They provide an algorithm to disambiguate 
all words in a sentence. It is directly applied to WSD with good results’ 

A New Approach that allows Integration in WordNet 

First, we will introduce the terminology used in the paper. We use concept and class 
indistinguishably, and they refer to the so-called synsets in WordNet. Concepts in WordNet are 
represented as sets of synonyms, e.g. <food, nutrient>. A word sense in WordNet is a word-concept 
pairing, e.g. given the concepts a=<chicken, poulet, volaille> and b=<wimp, chicken, crybaby> we 
can say that chicken has two word senses, the pair chicken-a and the pair chicken-b. In fact the former 
is sense 1 of chicken (<chickeni>),. and the later is sense 3 of chicken (<chickenj>). For the sake of 
simplicity, we also say that <chicken, poulet, volaille> is a word sense of chicken. When a concept is 
taken as a class, it represents the set of concepts that are subsumed by this concept in the hierarchy. 

Traditionally selectional preferences have been acquired for verbs and they do not take into account 
that different senses of the verbs have different preferences. Therefore, they are usually difficult to 
integrate in existing lexical resources as WordNet. We have extended Resnik’s selectional preferences 
model from word-to-class (e.g. verb - nominal concepts) to class-to-class (e.g. verbal concepts - 
nominal concepts). The model explored in this paper emerges as a result of the following 
observations: 

• Distinguishing verb senses can be useful. The examples for eat above are taken from WordNet, 
and each corresponds to a different word sense: example (1) is from the “ take in solid food' sense 
of eat, (2) from the ’’cause to rust " sense, and examples (3) and (4) from the " use up " sense. 

• If the word senses of a set of verbs are similar (e.g. word senses of ingestion verbs like eat, 
devour, ingest, etc.) they can have related selectional preferences, and we can generalize and say 
that a class of verbs shares the same selectional preference. 

Our formalization distinguishes among verb senses; that is, we treat each verb sense as a different unit 
that has a particular selectional preference. From the selectional preferences of single verb senses, we 
also infer selectional preferences for classes of verbs. For that, we use the relation between word 
senses and classes in WordNet. 

1.1. FORMALIZATION 

As mentioned in the previous sections we are interested in modelling the probability of a nominal 
concept given that it is the subject/object of a particular verb (1) or verbal concept 7 (2): 

7 Notation: v stands for a verb, cn (cv) stand for nominal (verbal) concept, cn, (cv ,) stands for the concept linked to the /-th 
sense of the given noun (verb), rel could be any grammatical relation (in our case object or subject), g stands for the 
subsumption relation, fr stands for frequency and fr for the estimation of the frequencies of classes. 


324 



P(cn, | rel v) (1) P(cn { | rel cvj ) (2) 

We will now explain the three models we have tested: word-to-class, word sense-to-class (from now 
on referred as sense-to-class), and class-to-class. For example, we will describe the probability of the 
nominal concept <chicken,> occurring as object of the verb eat. Examples of each of the models are 
provided in section 6 (cf. Table 3). 

Word-to-CLASS MODEL: P(cni | rel v) 

The probability of eat <chicken t > depends on the probabilities of the concepts subsumed by and 
subsuming <chicken{> being objects of eat. For instance, if chicken t never appears as an object of eat, 
but other word senses under <food, nutrient> do, the probability of the concept <chicken{> (first sense 
of chicken) will not be 0. 

Formula (3) shows that for all concepts subsuming cn, the probability of cn, given the more general 
concept times the probability of the more general concept being a subject/object of the verb is added. 
The first probability is estimated dividing the class frequencies of cn ( with the class frequencies of the 
more general concept. The second probability is estimated dividing the frequency of the general 
concept occurring as object of eat with the number of occurrences of eat with an object. 

Sense-to-class model: P(cn, | rel vj) 

Using a sense-tagged corpus, such as Semcor, we can compute the probability of the different senses 
of eat having as object the class <chickenf>. For each sense, we use the probability formula (3) as 
defined in 3.1.1. In this case we have different selectional preferences for each sense of the verb: P(cnj 
I rel vj). 

Class-to-class model: P(cnj | rel cvj) 

We compute the probability of the verb classes associated to the senses of eat having as object 
<chicken,> , using the probabilities of all concepts above <chickenj> being objects of all concepts 
above the possible senses of eat. For instance, if devour never appeared on the training corpus, the 
model could infer its selectional preference from that of its superclass <ingest, take in>. 

Formula (4) shows how to calculate the probability. For each possible verb concept (cv) and noun 
concept (cn) subsuming the target concepts (cn lf cvj), the probability of the target concept given the 
subsuming concept (this is done twice, once for the verb, once for the noun) times the probability the 
nominal concept being subject/object of the verbal concept is added. 

Benefits of the Approach 

The main benefits of our approach are the following: 

• Class-to-class preferences can be trained using untagged corpora. 

• In the case of sparse data the model can provide selectional preferences for word senses of verbs 
that do not occur in the corpus. 

• Class-to-class selectional preferences can be easily integrated in WordNet. 

• Distinguishes verb senses. 

• Generalizes selectional preferences for classes of verbs. 


325 



We can keep probabilities for all possible (verbal class, nominal class) pairs, but an algorithm for 
pruning is also provided (cf. Section 5). Table 1 summarizes the benefits of using class-to-class 
selectional preferences. 


P(^|re/v)= YP(cn,\cn)xP(cn\relv) = V f^cri) Jri.cnrelv) 

at 2 cn ( - cn3 ‘ n i fr{reh) 

P{cn t | rel cyj) = ^ P(cn t \ cn) x P(cVj | c v) x P(cn | rel c v) 
cnocrij cv^cys 

y y fr(cn n cn) ^ fr(cVj,cv) Jr(cnrelcv) 
ctrDcn^ cv^cvj Hen) >(cv) fr{relcv) 

1 


>(c«)= X 


cwjcot classes(cn,) 


x/r(c« f ) 


fr(cn { ,cn) = 


1 


c«/ccm classes (enj) 

0 


x fr(cnj) if cn t c ca 


otherwise 


Jr(cnrelv)= ^ 


1 


cn^zcn classe^cn,) 


x /r(c«, re/v) 


(3) 


(4) 

(5) 

( 6 ) 

(7) 


jr(cnrelcv) = £ £ —--—-x-- -—xfr(cn, relcv,) (8) 

ogam cvTcc« classe\cn t ) classe^cv,) 



Leama 

?le from: 

Apt for 
integration 
in WordNet 

Distinguishes 
verb senses 

Generalizes to 
classes of verbs 

tagged 

corpora 

untagged 

corpora 

word to class 

Yes 

Yes 

No 

No 

No 

sense to class 

Yes 

No 

Yes 

Yes 

No 

class to class 

Yes 

Yes 

Yes 

Yes 

Yes 


Table 1: Comparison of selectional preference models. 


Estimation of Class Frequencies 


Frequencies for classes can be counted directly from the corpus when the class is linked to a word 
sense that actually appears in the corpus, written as fr(cn). Otherwise they have to be estimated using 
the direct counts for all subsumed concepts, written as fr( cn i ) • Formula (5) shows that all the counts 
for the subsumed concepts ( cn t ) are added, but divided by the number of classes for which c,- is a 
subclass (that is, all ancestors in the hierarchy). This is necessary to guarantee the following: 

£/•(«!, |CFl)-l. 
cn^crij 

Formula (6) shows the estimated frequency of a concept given another concept. In the case of the first 
concept subsuming the second it is 0, otherwise the frequency is estimated as in (5). 

Formula (7) estimates the counts for [nominal-concept relation verb] triples for all possible nominal- 
concepts, which is based on the counts for the triples that actually occur in the corpus. All the counts 
for subsumed concepts are added, divided by the number of classes in order to guarantee the 
following: 


326 



4 


^.P(c«|re/v)=l 

• c n 

Finally, formula (8) extends formula (7) to [nominal-concept relation verbal-concept] in a similar 
way. 

Algorithm for Integration in WordNet 

In principle, we can take all nominal-concept and verbal-concept pairs that have a probability higher 
than 0 in the class-to-class model, and add a relation for each pair into WordNet. This is the model we 
have used for the WSD task (cf. section 8), but it adds too many relations. We have devised a pruning 
algorithm that chooses the highest probability nodes for each subtree combinations, discarding the rest 
(see Figure 1). The pruning algorithm does not affect the WSD results as it eliminates pairs that would 
never be selected. 

We will explain the pruning step for the nominal hierarchy first. We sort all the nominal classes in the 
selectional preference according to their probabilities. Starting with the concept with highest 
probability we prune all the nominal classes whose ancestors or descendants have appeared previously 
in the list. This way we only take the most informative nominal class from each branch of the 
hierarchy. We can sec an example in Figure 1. There we can observe that the concepts <b>, <c> and 
<e> are pruned because of the appearance of <a> higher on the table. Only <a> and <d> are left. 

For a given relation rel 
For each verb class cv, in WordNet 
Compute P(crij j rel cvj for each nominal class crij in WordNet 
Optional pruning 
End For 

Link all cn,cv pairs with rel if P(cn | rel cv) > 0 


Nominal Probability 

Class _ 

<a> 0.8 

05 

<e> 0.4 

<d> 0.3 

<e> 0.1 

Figure 1: Algorithm for pruning and example of pruning for nominal classes. 

This algorithm can be extended easily to pairs of nominal and verbal concepts: pairs <a,b> that have a 
pair <c,d> with higher probability and which either subsume both the nominal <c> and verbal <d> 
concepts or are subsumed by both nominal <a> and verbal <b> concepts are pruned. 

Examples 

We will analyze the selectional restrictions acquired for the verb know on the object relation. We 
chose this verb because it has enough occurrences in Semcor (514) and the number of senses is not 
too high (11). We found 87 occurrences of know with an object. 

In Table 2 we show the nominal classes associated as object to the verb know using the word-to-class 
model. We only include the classes that remain after pruning, as defined in the previous section. For 
each class its probability is given. We see that selectional preference in the word-to-class model is 
confusing, and information coming from different senses (see table 3 for the list of word senses) is 
mixed up in the resulting table. Besides, the probability values for the different nominal classes are 
low. 



327 






In Table 3 we can see the results using the sense-to-class and class-to-class approaches after pruning. 
Because of space constraints, only the classes with highest probability for each word sense are shown. 
For each sense, its description in WordNet and the number of examples for which an object relation 
has been found in Semcor is given. 

Most of the examples of know for which there is an object relation correspond to senses 1 and 2. 
Focusing on the sense-to-class model, we see that for sense 1 the highest probability is given to 
<communicatiori >, which seems a good choice. Sense 2 admits a wide variety of objects, and 
therefore concepts that are high in the hierarchy are preferred as selectional restrictions, as <entity, 
something> or <abstractiori>. The other senses have fewer occurrences, but anyway the system is able 
to detect some interesting restrictions, as <person, individual...> for the 4 th sense (which gets a very 
high probability of 0.33) or <idea, thougho for the 7 th sense. 

Class-to-class selectional restrictions tend to use more top-level concepts and are able to provide 
useful information for word senses that do not appear in the corpus (e.g.: the <make love,...> sense of 
know). There are some differences with the preferences obtained using the sense-to-class approach, 
but as we will show in the next section, both representations are valid and of similar quality. 

We observed that in some cases the ancestors of know could be very different and introduce noise, for 
example the ancestor <connect, link, tie> of the <make love,...> sense of know, but the algorithm is 
able to assign lower probability to those cases. To illustrate this example, in Figure 2 we show the 
ancestor hierarchy in WordNet for this sense of know, and Table 4 lists the selectional preferences for 
them. We can see that as the meaning of the verb ancestors gets more general, the noun concepts have 
lower probability. In this case, the <make love,.,> concept assigns high probability to the correct 
restriction <person , individual... >, and the < conned, link, tie > concept assigns lower probabilities to 
its objects because the verbs belonging to this class get very different selectional preferences. In this 
example, the restriction <egg> gets some credit because of the incorrect identification by Minipar of 
the relation mate-object-egg from the sentence “The first worker bees do not mate or lay eggs,...”. 


Nominal class 

Probability 

<person individual someone somebody mortal human soul> 

0,0778 

<abstraction> 

0,0656 

<object physical_object> 

0,0445 

<cognition knowledge> 

0,0380 

<group grouping> 

0,0330 

<state> 

0,0211 

<bodyjpart> 

0,0202 

<act humanaction human_activity> 

0,0190 

<spirit> 

0,0116 

<feeling> 

0,0073 


Table 2: Word-to-class selectional preferences for the objects of know. 


328 



Examples IProbabili 




$ense 1: know, cognize -- (be 
cognizant or aware of a fact or a 
specific piece of information; 
possess knowledge or information 
about; 


sense 2: know -- (know how to do or 
perform something; "She knows 
how to knit"; "Does your husband 
know how to cook?") 


sense 3: know - (be aware of the 
truth of something; have a belief-or 
faith in something; regard as true 
beyond any doubt; "I know that 1 lefi 
the key on the table"; "Galileo knew 
that the earth moves around the 
sun") 


sense 4: know — (be familiar or 
acquainted with a person or an 
object; "She doesn't know this 
composer"; "Do you know my 
sister?" "We know this movie") 


sense 5: know, experience, live — 
(have firsthand knowledge of states, 
situations, emotions, or sensations; 
"1 know the feeling!" ) 


sense 6: acknowledge, recognize, 
know - (discern; "His greed knew 
no limits") 


Sense 8: love, make out, make love, 
sleep with, get laid, have sex, know, 
do it,... 




Sense-to-class 


ominal Class 


<communication> 

<measure quantity amount quantum> 
<attribute> 

<object physical_object> 

<cognition knowledge> 


entity something> 

<cognition knowledge> 
<abstraction> 

^condition status> 

<act human_action human_activity> 
'wealth riches> 


concept conception construct 
<Time_period period period of time > 


0,3338 <person individual someone somebody 
mortal human soul> 

0,1417 <group grouping> 

286 <self ego> 

198 <abundance copiousness> 


Class-to-class 




ntity something> 
cognition knowledge> 
abstraction> 
condition status> 

act human_action human_activity> 
wealth richcs> 


abstraction> 

concept conception construct 
person individual someone 
imebody mortal human soul> 
doubt uncertainty incertitude dubiety 
oubtfulness dubiousness> 


person individual someone 
omebody mortal human soul> 
group grouping> 
self ego> 

^abundance copiousness> 



person individual someone 
•omebody mortal human soul> 
=physical_phenomenon> 


0,2883 kideathought> 


0,2883 kidea thought> 





Mill.61Bll.lll.JIM 


i (*]• * •T[*Th CT*j m, 


Verb concept 


<love, make out, make love, sleep with, get laid, have sex, 


<copulate, mate, pair, coup!e> 
make love 


<join, conjoin> 

make contact or come together 

"The two roads join here" 


<connect, link, tie> 

connect, fasten, or put together two or more pieces 
"Can you connect the two loudspeakers?" 


Probabili 


0.5521 


0.1742 


0.1018 


0.0237 


0.0152 


0.0054 


0.0048 


0.C031 


0.0028 


0.0031 


0.0019 


0.0010 


0.0007 


0.0006 


Nominal class 


<person individual someone somebody mortal human soui> 


<person individual someone somebody mortal human soul> 


<person individual someone somebody mortal human soul> 


<obiect physical obicct> 


art> 


<dirt filth grime soil stain 


<communication> 


<obiect physical object> 


<person individual someone somebody mortal human soul> 


<communication> 


<happenina occurrence natural event> 


<attributc> 


Table 4: Selectional preferences associated to the ancestor hierarchy for the <make love,.. .> sense of know. 

<love, make out, make love, sleep with, get laid, have sex, ...> 

~> <copulate, mate, pair, couple> — (make love; "Birds mate in the Spring") 

*•> <join, conjoin> -- (make contact or come together, "The two roads join here") 

■> <connect, link, tie> — (connect, fasten, or put together two or more pieces; "Can you connect the two loudspeakers?") 


Figure 2: Ancestor hierarchy for the <make love,...> sense of know. 


329 






















































































Training and Testing on a WSD Exercise 


The acquired preferences were tested on a WSD exercise. The goal is to choose the correct word 
sense for all nouns occurring as subjects and objects of verbs, but it could be also used to 
disambiguate the verbs. The method selects the word sense of the noun that is below the strongest 
nominal class for the verb or verb class. If more of one word sense is below the strongest class, all are 
selected with equal weight. 

More detailed account of the experiments can be found at (Agirre and Martinez 2000; 2001). Two 
experiments were performed. On the lexical sample, we selected a set of 8 nouns at random and 
applied 10 fold cross validation to make use of all available examples. On the all-nouns experiment, 
we selected four files previously used in other works and tested them in turn, using the rest of the files 
as the training set. 

Table 5 shows the data for the set of nouns of the lexical sample. Note that only 19% (15%) of the 
occurrences of the nouns are objects (subjects) of any verb. Table 6 shows the average results using 
subject and object relations for the different models. For the lexical sample task, each column shows 
respectively, the precision, the coverage over the occurrences with the given relation, and the recall. 
Random and most frequent baselines are also shown. Word-to-class gets slightly better precision than 
class-to-class, but class-to-class is near complete coverage and thus gets the best recall. All are well 
above the random baseline, but slightly below the most frequent sense (MFS). 

On the all-nouns experiment, we disambiguated the nouns appearing in four files extracted from 
Semcor. We observed that not many nouns were related to a verb as object or subject (e.g. in the. file 
br-aOl only 40% (16%) of the polysemous nouns were tagged as object (subject)). Table 6 shows the 
average recall obtained on this task. Class-to-class attains the best recall. 

We think that given the small corpus available, the results are good. Note that there is no smoothing 
or cut-off value involved, and some decisions are taken with very little points of data. Sure enough, 
both smoothing and cut-off values will permit to improve the precision. On the contrary, literature has 
shown that the most frequent sense baseline needs less training data. 


Noun 

# sens 

# occ 

# occ. as obi 

# occ. as subi 

account 

10 

27 

8 

3 

age 

5 

104 

10 

9 

church 

3 

128 


10 

duty 

3 

25 

8 

1 

head 

30 

179 

58 

16 

interest 

7 

140 

31 

13 

member 

5 

74 

13 

11 

people 

4 

282 

41 

83 

Overall 

67 

959 

188 

146 


Table 5: Data for the selected nouns. 



Lexical sami 

ale (8 nouns) 

All-nouns (4 files) 


Obj. 

Subj. 

Obj. 

Subj. 


Prec. 

Cov. 

Rec. 

Prec. 

Cov. 

Rec. 

Rec. 

Rec. 

Random 

.192 

1.00 

.192 

.192 

1.00 

.192 

.265 

.296 

MFS 

.690 

1.00 

.690 

.690 

1.00 

.690 

.698 

.790 

Word2class ! 

.669 

.867 

.580 

.698 

.794 

.554 

.424 

.597 

Class2class 

.666 

.973 

.648 

.680 

.986 

.670 

.506 

.692 


Table 6: Average results for the 8 nouns and the 4 files. 


330 



Conclusions 


We presented a statistical model that extends selectional preference to classes of verbs, yielding a 
relation between classes in a hierarchy, as opposed to a relation between a word and a class. The 
motivation to depart from word-to-class models is twofold: different senses of a verb may have 
different preferences, and some classes of verbs can share preferences. Besides, the model can be 
trained on untagged corpora, and has the following advantages over word-to-class models: it can 
provide selectional preferences for word senses of verbs that do not occur in the corpus via 
inheritance, and it provides a model that can be integrated easily as a relation between concepts in 
WordNet. In this sense, an algorithm to integrate the acquired class-to-class selectional restrictions in 
WordNef in a sensible way has also been described. 

The model is trained using subject-verb and object-verb relations extracted from a sense- 
disambiguated corpus by Minipar. A peculiarity of this exercise is the use of a sense-disambiguated 
corpus, in contrast to using a large corpus of ambiguous words. Evaluation is based on a word sense 
disambiguation exercise for a sample of words and a sample of documents from Semcor. The 
proposed model gets similar results on precision but significantly better recall than the classical word- 
to-class model. This can be explained by the fact that the class-to-class model generalizes well and is 
able to provide sensible selectional preferences to verb senses not seen in the training data. 

For the future, we plan to train the model on a large untagged corpus, in order to compare the quality 
of the acquired selectional preferences with those extracted from this small tagged corpora. The model 
can easily be extended to disambiguate other relations and PoS, and we plan to measure the 
effectiveness for other PoS. 

References 

Abe, H. & Li, N. (1996) Learning Word Association Norms Using Tree Cut Pair Models. In 
Proceedings of the 13th International Conference on Machine Learning ICML. 

Agirre E. and Martinez D. (2000) Decision lists and automatic word sense disambiguation. COLING 
2000, Workshop on Semantic Annotation and Intelligent Content. Luxembourg. 

Agirre E. and Martinez D. (2001) Learning class-to-class selectional preferences. Proceedings of the 
Workshop ’’Computational Natural Language Learning" (CoNLL-2001). In conjunction with 
ACL'2001/EACL'2001. Toulouse, France. 6-7th July 2001. 

Lin, D. (1993) Principle Based parsing without Overgeneration. In 31st Annual Meeting of the 
Association for Computational Linguistics. Columbus, Ohio, pp 112-120. 

Miller, G. A., R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. (1990) Five Papers on WordNet. 
Special Issue of the International Journal of Lexicography, 3(4). 

Miller, G. A', C. Leacock, R. Tengi, and R. T. Bunker. (1993) A Semantic Concordance. Proceedings 
of the ARPA Workshop on Human Language Technology. 

Resnik, P. (1992) A class-based approach to lexical discovery. In Proceedings of the Proceedings of 
the 30th Annual Meeting of the Association for Computational Linguists., 327-329. 

Resnik,P. (1997) Selectional Preference and Sense Disambiguation.. In Proceedings of the ANLP 
Workshop ‘‘Tagging Text with Lexical Semantics: Why What and How?”, Washington, DC. 

Stetina J., Kurohashi S., Nagao M. (1998) General Word Sense Disambiguation Method Based on a 
Full Sentential Context. In Usage of WordNet in Natural Language Processing , Proceedings of 
COLING-ACL Workshop. Montreal (Canada). 

Eneko Agirre, IXA NLP Group, University of the Basque Country, 649 pk. 20.080, Donostia. Spain. 
eneko@si.ehu.es 

David Martinez, IXA NLP Group, University of the Basque Country, 649 pk. 20.080, Donostia. Spain. 
iibmaird@si.ehu.es 


331 



Words with Attitude 


Jaap Kamps 
Maarten Marx 


Abstract 

The traditional notion of word meaning used in natural language processing is literal or lex¬ 
ical meaning as used in dictionaries and lexicons. This relatively objective notion of lexical 
meaning is different from more subjective notions of emotive or affective meaning. Our aim 
is to come to grips with subjective aspects of meaning expressed in written texts, such as 
the attitude or value expressed in them. This paper explores how the structure of the Word- 
Net lexical database might be used to assess affective; or emotive meaning. In particular, we 
construct measures based on Osgood’s semantic differential technique. 


1 Introduction 

The traditional notion of word meaning used in natural language processing is literal or lexical meaning. 
This is the way the meaning of words is explained in dictionaries and lexicons. And, as may come as 
no surprise, the majority of research in natural language processing deemphasizes other aspects of 
meaning. Yet at the same time, we find a myriad of notions of meaning in the writings of philosophers, 
linguists, psychologists, and sociologists. This is not the place to have an extensive discussion on 
the meaning of meaning, but our aim will be to bring other notions of meaning into natural language 
processing. In particular, we will be interested in the differences between the relatively objective notion 
of lexical meaning, and more subjective notions of emotive or affective meaning. 

Suppose we can evaluate the subjective meaning expressed in a text. This would allow us to classify 
documents on subjective criteria, rather than on their factual content. This can be as radical a change as 
categorizing the screws in a repair shop’s inventory by their beauty, instead of their size and material. 
This may not be very practical for a repair shop, but document classification does not require a physical 
rearrangement of objects. As a result, it would simply provide an additional classification criterion. It 
is not difficult to envision cases in which precisely a subjective categorization is desirable and useful. 

Our aim is to come to grips with aspects of the subjective meaning expressed in written texts, such 
as the attitude or value expressed in them. Of course, there are well-established approaches for this in 
the social and behavioral sciences. In particular, methods like surveys or test panels in which persons 
evaluate certain subjective criteria. However, the advent of the Internet gives us access to large numbers 
of documents and large corpora. Here, applying these traditional methods of evaluation is impractical: 
it is simply too time-consuming and very costly. For these reasons, we are interested in measures that 
can be evaluated automatically. 

Our working hypothesis is that subjective aspects of meaning can be derived from the particular 
choice of words in a text. That is, there are indeed words with attitude or values. Prominent candidates 
for this are modifiers, such as descriptive adjectives like ‘beautiful’ or ‘good’ (and their antonyms ‘ugly’ 
and ‘bad’). This paper explores how to assess more subjective aspects of meaning by using the structure 
of the WordNet lexical database (Miller, 1990; Fellbaum, 1998). At first glance, this may appear to be 
a bad choice because the words in WordNet are structured by their lexical meaning. In particular, the 
synonymy or SYNSET relation in WordNet represents the coincidence of lexical meaning. However, 
the organization of WordNet is not a conventional alphabetical list, but a large interconnected network 
of words (resembling the organization of human lexical memory). One of the design principles in 



WordNet is a differential theory of meaning: the meaning of a concept is determined by its place 
relative to other concepts. It is precisely this larger WordNet structure that we want to exploit. 

This paper is structured as follows. In §2, we will discuss a classical theory for measuring affective 
or emotive meaning. From this we take the major factors that differentiate between values or attitude. 
Then, in §3 we explore how we can translate the structure of WordNet into a measure for these factors. 
Next, in §4 we discuss how such measures can be implemented, and we end in §5 with conclusions and 
some discussion. 

2 Affective Aspects of Meaning 

Our aim is to measure the subjective meaning expressed in a text. For such an enterprise to be success¬ 
ful, there must be sufficient generality in the semantic dimensions used by individuals. This immedi¬ 
ately prompts a number of questions: do such generic semantic dimensions exist at all? And if so, can 
we characterize these specific semantic dimensions? 

The classic work on measuring emotive or affective meaning in texts is Charles Osgood’s Theory 
of Semantic Differentiation (Osgood et al., 1957). Osgood and his collaborators identify the aspect of 
meaning in which they are interested as 

a strictly psychological one: those cognitive states of human language users which are necessary antecedent 
conditions for selective encoding of lexical signs and necessary subsequent conditions in selective decoding of 
signs in messages. (Osgood et al., 1957, p.318) 

Their semantic differential technique is using several pairs of bipolar adjectives to scale the responses of 
subjects to words, short phrases, or texts. That is, subjects are asked to rate their meaning on scales like 
active-passive; good-bad; optimistic-pessimistic; positive-negative; strong-weak; serious-humorous; 
and ugly-beautifully. 

Each pair of bipolar adjectives is a factor in the semantic differential technique. As a result, the 
differential technique can cope with quite a large number of aspects of affective meaning. A natural 
question to ask is whether each of these factors is equally important. Osgood et al. (1957) use factorial 
analysis of extensive empirical tests to investigate this question. The surprising answer is that most 
of the variance in judgment could be explained by only three major factors. These three factors of 
the affective or emotive meaning are the evaluative factor (e.g., good-bad); the potency factor (e.g., 
strong-weak); and the activity factor (e.g., active-passive). Among these three factors, the evaluative 
factor has the strongest relative weight. In the next section, we will focus on this most important factor 
of affective meaning. 

3 Affective Meatning and WordNet 

We will now investigate measures for the evaluative factor of meaning based on the WordNet lexical 
database (Fellbaum, 1998). The WordNet database has entries on the level of words (just as traditional 
dictionaries and lexicons). The unit of evaluation we are interested in is not individual words, but larger 
units of text, such as phrases, paragraphs, and larger units. We will proceed as follows: we will first 
investigate WordNet-based measures for individual words, and then consider ways of aggregating the 
scores of individual words to larger textual units. For example, an obvious way is to view a textual unit 
as a bag of words, and evaluate the text by combining the scores for the individual words in the text. 

The evaluative dimension of Osgood is typically determined using the adjectives ‘good’ and ‘bad’ 
(other operationalizations are possible depending on the subject under investigation). Indeed, if we 
look up the meaning of these two evaluative adjectives in WordNet we find that they are each other’s 
antonym. Our plan is to evaluate individual words by determining their relation to the words ‘good’ and 
‘bad’ in the WordNet database. For this, we can use the synonymy relation (or a generalization of it) to 
establish the relatedness of two words. That is, WordNet’s SYNSET relation may provide a handle to 
determine Osgood’s evaluative factor. 


333 



We will define the notion of n-relatedness based on the SYNSET relation (this is similar to the 
graph-theoretic notion of connectedness). 

Definition 1 Two words wq and w n are n-related if there exists an (n + 1 )-long sequence of words 
(t^Oj t«ii • • •, tL> n ) such that for each i from 0 to n — 1 the two words Wi and 1 are in the same 
SYNSET (i.e., and Wi+i are synonymous). 

For example, the adjectives ‘good’ and ‘proper’ are 2-related since there exists a 3-long sequence 
(good, right, proper). Two words may of course be related by many different sequences, or by none 
at all. We will mainly be interested in the shortest sequences relating words. The minimal path-length 
(MPL) of two words W{ and Wj is n if there is an (n + l)-long sequence relating and Wj and there 
is no sequence with length < n. If there is no sequence relating the two words, then the minimal 
path-length is undefined. 

Definition 2 Let MPL be a partial function such that MPL(u>i,u;j) = n if n is the smallest number 
such that Wi and Wj are n-related. 

The minimal path-length enjoys some of the geometrical properties we might expect from a distance 
measure. 

Observation I The minimal path-length is a metric, that is, it gives a non-negative number 
MPL(ic/i,tity) such that 

i) MPL(wi t Wj)— 0 if and only ifwi = wj, 

ii) MPL(tyi,v/j) = MPL(itty,to<), and 

Hi) MPL(u>i,vty) + MPL(ttty,tOfc) > MPL(u>i, Wk). 

The minimal path-length is a straightforward generalization of the synonymy relation. The synonymy 
relation connects words with similar meaning, so the minimal distance between words says something 
on the similarity of their meaning. For example, using WordNet we now find that 

• MPL(good, proper) = 2, 

• MPL(good, neat) = 3, and 

• MPL(good, noble) = 4. 

This suggest that we can use the MPL distance measure for determining Osgood’s evaluative dimen¬ 
sion, for example by scoring words that are closely related to the words ‘good’ and ‘bad’ respectively. 
That is, we might consider using the distance to the word ‘good’ as a measure of ‘goodness.’ This 
makes sense considering the SYNSET relation in WordNet is representing similarity of meaning, and 
our MPL is a straightforward generalization of the SYNSET relation. 

Figure 1 shows the minimal-path lengths of a selection of adjectives to the adjective ‘good’ based 
on the WordNet database. 1 Inspection of such a cloud of words gives us some confidence in the use 
of MPL as a measure for similarity of meaning. Note that we do not claim that the values obtained in 
this way are a precise scale for measuring degrees of goodness. Rather, we only expect a weak relation 
between the words used to express an positive opinion and their distance to words like ‘good.’ 

However, further experimentation quickly reveal that this relation is very weak indeed. It turns out 
that the similarity of meaning waters down remarkably quick. A striking example of this is that we also 
find that ‘good’ and ‘bad’ themselves are closely related in WordNet. 

Observation 2 There exists a 5-long sequence (good, sound, heavy, big, bad). So, we have that 
MPL(good, bad) = 4. 

'To be more precise: these are all adjectives w with MPL(good, to) < 2 and word familiarity or polysemy count > 2. 


334 




Figure 2: The MPL’s to adjectives ‘good’ and ‘bad’. Nodes are connected by edges of length corre¬ 
sponding to the MPL. 



Figure 3: The values assigned by the EVA function. 


335 




Even though the adjectives ‘good’ and ‘bad’ have opposite meaning—they are antonyms—they are still 
closely related by the synonymy relation. 2 As a result of this, we must seriously question whether the 
relatedness to the word ‘good’ is a measure of ‘goodness,’ since any word related to ‘good’ is at most 
slightly less close-related to ‘bad.’ 

Observation 3 For any w, i/MPL(good,u;) = n then n — 4 < MPL(bad, tu) < n + 4. 

We seem to be at a dead-end: the WordNet database gives us similarity of meaning by its SYNSET 
relation, but its straightforward generalization MPL fails to provide a general measure of coincidence 
of meaning. 

At this point several strategies present themselves. We might argue that, despite of observation 3, we 
may still expect some correlation between the opinion expressed in a text, and. (a refined version of) a 
distance measure like MPL. Here, we will pursue an alternative strategy based on the fact that any word 
that is related to the adjective ‘good’ is also related to the adjective ‘bad’ (and vice versa). That is, we 
will use observation 3 to our advantage. 

For each word, we can consider not only the shortest distance to ‘good’ but also the shortest distance 
to the antonym ‘bad.’ Figure 2 shows the minimal-path lengths of words to both the adjectives ‘good’ 
and ‘bad.’ 3 Inspection reveals that words neatly cluster in groups depending on the minimal path- 
lengths to ‘good’ and ‘bad’. In short, this sort of graphs seems to resonate closely with an underlying 
evaluative factor (at least, much better than graphs based on a single distance measure such as figure 1). 

We try to materialize this impression by defining a three argument function TRI that measures the 
relative distance of a word to two reference words. 


Definition 3 We define a partial function TRI of W{, Wj, and Wk (with Wj wyj as 


TRI(u;i;ti;j,u;fc) 


MPL(ti/i,iu*) — MPL(wi,u/j) 

MPL(w M ;Wj) 


If any o/MPL(u;t, Wj), MPL(iUf, Wk), or MPL(iUfc, Wj) is undefined, then TRI(tUi; Wj , Wk) is undefined. 

We calculate the function TRI based on two reference words (i uj and Wk in definition 3). The maximal 
difference in minimal-path length to the two reference words depends on the MPL of the two refer¬ 
ence words (by observation 3). Therefore, we divide the difference by the MPL of the two reference 
words, yielding a value in the interval [-1,1]. In particular, we will be interested in the partial function 
TRI instantiated for the reference words ‘good’ and ‘bad'. Recall that these two words correspond to 
Osgood’s evaluative factor. 


Definition 4 We define a partial function EVA ofw as EVA(-io) = TRI(io; good, bad). 


We now have that every word, provided it is related to the adjectives ‘good’ and ‘bad,’ will be assigned 
a value ranging from -1 (for words on the ‘bad’ side of the lexicon) to 1 (for words on the ‘good’ side 
of the lexicon). Figure 3 shows how the EVA function assigns values based on the minimal-path lengths 
from adjectives ‘good’ and ‘bad.’ 

For example, using WordNet we now find the following measures: 


. EVA(proper) = TRI(proper; good, bad) = = 1. 

• EVA(neat) = ^ = 0, 

• EVA(noble) = ^ = 0.25, 

2 AIthough this is perhaps remarkable, it is not due to some error in the WordNet database (there exist several paths of 
length 5). Part of the explanation seem to be the wide applicability of these two adjectives (WordNet has 14 senses of bad 
and 25 senses of good). Think of the small world problem predicting mean distance of 6 between arbitrary people (Milgram, 
1967). 

3 To be more precise: these are all adjectives w with MPL(good, w) < 3 or MPL(bad,to) < 3, and with word familiarity 
or polysemy count > 2. 


336 



• EVA(good) = ^2 = 1, and 

• EVA(bad) » - -1. 


Note that we do not claim that the EVA function assigns aprecise measure of the ‘goodness’ or ‘badness’ 
of individual words (if such a thing is possible at all). Rather, we can only expect that it allows us to 
differentiate between words that are predominantly used for expressing positive opinions (values close 
to 1), or for expressing negative opinions (values close to -1), or for neutral words (values around 0). 

Recall that EVA is a partial function that is undefined for words that are unrelated to the adjectives 
‘good’ and ‘bad.’ The unrelatedness of such words is a sign that they are indifferent for assessing 
the evaluative factor. That is, unrelated words can be regarded as neutral for the EVA function. We 
will complete the partial function EVA in precisely this way, and define a complete function EVA* that 
returns a value for any arbitrary word. 4 

Definition 5 The function EVA* is defined as follows: 


EVA*(w) = 


EVA(w) if defined 
0 if undefined 


Recall that we are mainly interested in evaluating larger textual units. A straightforward aggregation 
procedure is to view a text as a bag of words, evaluate each a these individual words, and simply add up 
their scores. Slightly abusing our notation, we will generalize the EVA* function to apply to arbitrary 
sequences of words. 

Definition 6 Let {w\be a bag of words. We define the function EVA* as follows: 

EVA*«u);,.... w n )\= jr EVA*(u>j) 

‘ . 't=i • 

1 • •. •. .. . • ij * r ‘ .V 

We now have a function EVA* that gives us a value for any arbitrary text. The precise interpretation 
of this value is not immediately clear, because it depends on how well our operationalization captures 
the concept of meaning we set out to measure (which was not very precisely defined to start with). 
Although EVA* function yields a specific value, we will be happy to use it as a coarse-grained ordinal 
scale. For example, by classifying text as positive, neutral, or negative, depending on the sign of the 
EVA* function. ' " 


4 Implementation and Evaluation 

In the previous section, we have defined a function EVA* that gives a measure for the evaluative factor 
of meaning expressed in a text. To apply this measure in practice would require us to calculate a large 
number of minimal path-lengths between words (recall the definition of EVA* in terms of TRI and 
MPL). Calculating a large number of minimal path-lengths is far from trivial in a large network like the 
WordNet database. Especially since many words will not be related to the adjectives ‘good’ and ‘bad,’ 
which is the hardest case to establish. To make this problem feasible, we compile lists of words related 
to ‘good’ and ‘bad,’ either up to a particular MPL, or all related words. Words not occurring on this list 
have EVA* value zero, and can be safely ignored. 

For this purpose, we have implemented a set of scripts that can efficiently generate related words 
by their MPL. The script starts with a particular word (such as ‘good’) and recursively generates all 
synonyms while filtering away words it has encountered earlier. That is, we start with a particular word 
w (l.e., having minimal path-length zero to itself), then generate all words w t - with MPL(tu,iyi) = 1, 

4 To be more precise, following WordNet we use the SYNSET relation only for words with the same part-of-speech (nouns, 
adjectives, verbs, adverbs), and only consider adjectives for EVA and EVA*. That is, the EVA* of a verb or noun is zero. 


337 



then with MPL(tu, Wi) = 2, etcetera, until the search exhausts, or until we reach a given maximal value 
of MPL. The script has an additional argument that allows us to ignore words with a low polysemy 
count. By running this script on two related words (such as ‘good’ and ‘bad’), we will have determined 
the minimal path-lengths needed for calculating the weight of all related words. The resulting list of 
rated words can be stored in a file for future use. 

In particular, we can run these scripts exhaustively on the adjectives ‘good’ and ‘bad.’ As it turns out, 
this generates a list of 5410 adjectives (or a component in graph-theoretical terms). 

Observation 4 The set of words which are n-related to the adjectives 'good' and 'bad' (for some n) 
consists of 5410 adjectives. 

The adjective cluster in which ‘good’ and ‘bad’ reside, contains 25% of the adjectives in the WordNet 
database. 5 For each of these words, we can assign a wei ght corresponding to the evaluative factor of 
meaning: the EVA function assigns a value in the interval [-1,1), with positive values for words on 
the ‘good’ side and negative values for words on the ‘bad’ side. Note that this exhaustive list will 
completely determine the EVA* function: all words not on this list will have EVA* value zero. This 
allows us to efficiently calculate the EVA* function of a text. 

The exhaustive list of adjectives related to ‘good’ and ‘bad’ is also useful in its own right. We can use 
such lists for further evaluation of the constructed measures. In particular, one may suspect there to be 
a bias towards one of the bipolar adjectives, simply by the number of words in the WordNet database. 
This is not unlikely considering that the WordNet database gives 35 synonyms of the adjective ‘good,’ 
and only 15 synonyms of ‘bad.’ Using the exhaustive list: of all 5410 adjectives related to ‘good’ and 
‘bad,’ we can simply add up each word’s assigned value. Recall that these values range from -1 to 
1, so if the amounts of positive and negative words are completely balanced, the grand total will be 
zero, making the mean value assigned to a word zero as well. It turns out that the total score over 5410 
words is -48.25, yielding a mean value of r ^q 5 = —0.0089. 6 This is only a marginal deviation, so 
we may conclude that the list of words is well-balanced between the two opposite words. In light of 
the resemblance of the WordNet database structure to human lexical memory, this finding increases our 
confidence that the EVA* measure is corresponding to an evaluative aspect of meaning. This relates to 
one of the problems left unsolved in Osgood et al. (1957, p.327). 

One of the most difficult methodological problems we have faced—unsuccessful so far—is to demonstrate that 
the polar terms we now use are true psychological opposites, i.e., fall at equal distances from the origin of the 
semantic space and in opposite directions along a straight line passing through the origin. 

Almost half a century later, our measure based on the WordNet database provides some indirect evi¬ 
dence for this. 7 In this sense, our work can also be viewed as a partial evaluation of Osgood’s original 
semantic differential technique. 

The same set of scripts also allows us to compile lists for the other factors of meaning. For Osgood’s 
potency factor, the prototypical operationalization is using the adjectives ‘strong’ and ‘weak.’ As it 
turns out, ‘strong’ and ‘weak’ are antonyms in WordNet, but also related by the synonymy relation. 

Observation 5 MPL(strong, weak) = 6 

We can define a function POT* as follows: 

s Our version of WordNet, 1.7, has 21365 adjectives (i.e., when counting unique strings), so the cluster surrounding ‘good’ 
and ‘bad’ is 25.32% of the total collection of adjectives. Generating the exhaustive list for adjectives ‘good’ and ‘bad’ takes 
6 minutes and 19 seconds on a Pentium-Ill 800Mhz with 512 MB memory running Red Hat Linux 7.0. 

6 Perhaps we can make this more clear by estimating the number of words with the ‘wrong’ sign. Since negative words 
range from -1 to 0, the average weight of a negative word is -0.5. So we may estimate the excess of negative words to be 
' "-offi = words, which is 1.78% of the total number of words in the list. This amounts to flipping the sign of 48 words 
in the list. 

7 At least, this seems to be the case for the English language, it is unclear whether there are significant differences in other 
languages or cultures. This could be investigated using the multi-lingual versions of EuroWordNet (Vossen, 1998). 


338 



Definition 7 The function POT* of a word w is defined as follows: 


POT*(io) = 


TRl(w; strong, weak) if defined 
0 otherwise 


Let (toi,..., w n ) be a bag of words. We define the function POT* of a tuple (w\ 7 w n ) as: 

POT*(K.t%» = X>OT>i) 

*=1 

The third major factor of meaning, Osgood’s activity factor, is usually operationalized using the two 
adjectives ‘active’ and ‘passive.’ Again, adjectives ‘active’ and ‘passive’ are antonyms in WordNet, but 
also related by the synonymy relation 

Observation 6 MPL(active, passive) = 12 

We define a function ACT* just like EVA* and POT* but now with the reference words ‘active’ 
and ‘passive.’ Specifically, we will use the TRI(ur, active, passive) function yielding a value 1 for 
ACT*(active), and —1 for ACT*(passive). 

Similar to the evaluative factor, our set of scripts generates lists of all related adjectives for the 
potency and activity factors. Investigating these three lists, we immediately discover the following, 
remarkable finding. 

Observation 7 All three lists corresponding to EVA*, POT*, and ACT* functions single-out the same 
cluster of 5410 related adjectives in WordNet. 

This cluster of words appears to have a special status: it contains all the important modifiers used to 
express emotive or affective meaning—to use our slogan, these are “words with attitude.” Although 
the three measures use the same set of words, the distribution of weights is radically different. These 
weights for each of the three measures is calculated from different words, giving rise to different mini¬ 
mal path-lengths, and thus different values. For example, we find that 

• EVA* (proper) = 1.00; 

• POT* (proper) = 0.50; and 

• ACT*(proper) = 0.09. 

Our future research is to evaluate the measures of this paper, and refinements decorated with poly¬ 
nomial constants. Ideally, this should be done on a test corpus that has been rated on the affective or 
emotive meaning expressed in the texts. So far we have been unable to locate such a corpus, and are 
investigating ways to construct one ourselves. We have also done initial tests on texts found on Inter¬ 
net discussion sites (without taking into account negation, nor parsing the texts to determine syntactic 
categories of words). The first initial observation on this small test set is that there is correspondence 
between the measures and the meaning expressed. On the one hand, the measures are not flawless 
when considering individual texts. This is hardly surprizing since sometimes none or very few of the 
special adjectives occur in these short texts. On the other hand, however, over larger sets of texts the 
measure gives a much better impression (i.e., the false positives and false negatives seem to cancel out 
each-other). We need extensive empirical tests in order to qualify what the particular value means (i.e., 
can we distinguish degrees of goodness instead of more coarse-grained distinctions). Since scores in¬ 
crease with the length of a text, it is clear that some normalization for the length of a text is needed for 
considering the value to indicate the degree of goodness. Another initial observation is that, although 
the set of words is well-balanced between the opposing sides, there appears to be a bias towards the 
good-side of the evaluative factor. That is, there seems to be a tendency to expound negative judgments 


339 



more concisely than positive judgments. The existence of an asymmetry between positive and nega¬ 
tive deviations is also known from judgments under uncertainty, think of prospect theory (Tversky and 
Kahneman, 1974). If a similar bias in positive word-choice exists, we can easily compensate for it by 
the relative weight we assign to words. 

5 Conclusions and Discussion 

In this paper, we investigated measures for affective or emotive aspects of meaning derived from the 
structure of the WordNet lexical database. Such a project presupposes that subjective aspects of mean¬ 
ing can be derived from the choice of words in a text. That is, there are indeed words with attitude or 
values. This is not undisputed, some philosophers have been skeptical whether different people’s words 
can mean the same (Quine, I960). 8 Our focus on texts, rather than other modes of communication, 
gives some confidence that certain aspects of the expressed meaning can be derived from the particular 
word choice. One of the types of texts we consider interesting are texts on Internet discussion sites: 
here there is a strong incentive for the writer to make sun; that a reader can grasp the intended meaning 
from the textual content. Even the mere existence of such discussion sites can be viewed as evidence 
that this is the case. 

Mainstream research in natural language processing deemphasizes more subjective aspects of mean¬ 
ing (Manning and Schiitze, 1999; Jurafsky and Martin, 2000). Our work can be viewed as an attempt 
to rectify this. A consequence of going beyond the established notion of lexical meaning, is that there 
is no consensus on notions of affective or emotive meaning. So it is not immediate clear what no¬ 
tions to use. We decided to go back to one of the seminal works on measuring affective meaning, 
Osgood et al. (1957)’s semantic differential technique. From this, we took some of the most important 
factors of affective meaning, the evaluative, potency, and activity factors. The second crucial ingredient 
is our use of the WordNet lexical database (Miller, 1990; Fellbaum, 1998). The basic notion of meaning 
used in WordNet is lexical meaning, and WordNet’s main SYNSET relation is denoting coincidence of 
lexical meaning. However, it is important to stress that WordNet is partly inspired by psycho!inguistic 
theories of human lexical memory. That is, the meaning of words is also determined by its place in the 
larger structure of the database. Also note that this larger structure shows some resemblance with our 
own lexical memory. In this paper, we have translated this structure back into concrete measures for 
the Osgood factors of meaning. All the three resulting measures single-out the same cluster of 5410 
adjectives, which is 25% of the adjectives in WordNet. This cluster appears to have a special status: 
it contains all the important modifiers used to express affective or emotive meaning—these are words 
with attitude. 

As it turned out, the measures we constructed are based on a distance metric. This relates our work 
to the ubiquity of measures of distance, similarity, or relatcdness in natural language processing. To 
name a few, the use of path-length as a measure of similarity can be traced to (Quillian, 1968). The 
use of path-length as similarity metric also discussed in (Rada et al., 1989). A recent evaluation of five 
distance measure can be found in (Budanitsky and Hirst, 2001). Most measures of relatedness use more 
than just the synonymy relation. For our purposes, this is not useful because it destroys the bipolarity 
of the concepts we are interested in. For example, all our pairs of adjectives are directly related by the 
antonymy relation, and one may suspect a close common hypemym. Although there is similarity with 
the traditional distance measures use in NLP, it is important to stress that we use these measures for 
different purposes. Already Quillian (1968, p.228) has it that 

8 This reminds of (Carroll, 1871, Chapter 6): 

“When I use a word,” Humpty Dumpty said in rather a scornful tone, “it means just what I choose it to 
mean—neither more nor less.” 

“The question is,” said Alice, “whether you can make words mean so many different things." 

“The question is,” said Humpty Dumpty, "which is to be master—that’s all.” 


340 



One issue facing the investigator of semantic memory, is: exactly what is it about word meanings that is to be 
• considered? First, the memory model here is designed to deal with exactly complementary kinds of meaning 
to that involved in Osgood’s "semantic differential” (Osgood et al., 1957). While the semantic differential is 
concerned with people’s feelings in regard to words, or the words possible emotive impact on others, this model 
is explicitly designed to represent the nonemotive, relatively “objective” part of meaning. 

We have shown in this paper how a measure for the affective meaning studied by Osgood can be derived 
from a representation of the relatively “objective” meaning as represented in the WordNet database. 

Acknowledgments This research was supported by the Netherlands Organization for Scientific Re¬ 
search (NWO, grants # 400-20-036 and # 612-62-001). Thanks to Patrick Blackburn, Michael Masuch 
and Ivar Vermeulen for their comments, and to Rob Mokken for suggesting Osgood’s semantic differ¬ 
ential technique. All data is derived from Princeton WordNet 1.7, using Perl and Dan Brian’s excellent 
Lingua: -.Wordnet module. •** « 

Institute for Logic, Language and Computation 
University of Amsterdam 
Nieuwe Achtergracht 166 
1018WV Amsterdam 
the Netherlands 

{kamps, marx}@il lc. uva. nl 

References 

Alexander Budanitsky and Graeme Hirst. 2001. Semantic distance in wordnet: An experimental, application- 
oriented evaluation of five measures. In Workshop on WordNet and Other Lexical Resources. Second meeting 
of the North American Chapter of the Association for Computational Linguistics, Pittsburgh. 

Lewis Carroll. 1871. Through the looking-glass, and what Alice found there. Macmillan, London. 

■ o\- . . 

Christiane Fellbaum, editor. 1998. WordNet: Ah Electronic Lexical Database, Language, Speech, and Commu¬ 
nication Series. The MIT Press, Cambridge MA. 

' . j • * ’ • *: _• ... ^ 

Daniel Jurafsky and James H. Martin. 2000. Speech and Language Processing: An Introduction to Natural 
Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, Upper Saddle River 
NJ. 

Christopher D. Manning and Hinrich Schiitze. 1999. Foundations of Statistical Natural Language Processing. 
The MIT Press, Cambridge MA. 

Stanley Milgram. 1967. The small world problem. Psychology Today , 2:61-67. 

George A. Miller. 1990. WordNet: An on-line lexical database. International Journal of Lexicography, 
3(4):235—312. Special Issue. 

Charles E. Osgood, George J. Succi, and Percy H. Tannenbaum. 1957. The Measurement of Meaning. University 
of Illinois Press, Urbana IL. 

M. Ross Quillian. 1968. Semantic memory. In Marvin Minsky, editor, Semantic Information Processing, chap¬ 
ter 4, pages 227-270. The MIT Press, Cambridge MA. 

Willard Van Orman Quine. 1960. Word and Object. The MIT Press, Cambridge MA. 

Roy Rada, Hafedh Mili, Ellen Bicknell, and Maria Blettner. 1989. Development and application of a metric on 
semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19:17-30. 

Amos Tversky and Daniel Kahneman. 1974. Judgment under uncertainty: Heuristics and biases. Science, 
185:1124-1131. 

Piek Vossen. 1998. Special issue on EuroWordNet. Computers and the Humanities, 32(2/3):73-251. 


341 


Metaphoric Expressions: An Analysis of Data from a Corpus 
and the ItalWordNet Database* 


Antonietta Alonge 
Margherita Castelli 


Abstract 

This paper reports on the work carried out so far, in the context of the ISLE project, to envisage if and 
how information on metaphoric expressions should be encoded in a (multilingual) lexical entry for 
NLP applications. When analysing corpus data we find a huge number of metaphoric expressions 
which can be hardly dealt with by using as reference databases resources already developed. Thus, the 
problem we are facing is to what extent and how information on metaphors could be encoded in a 
general lexicon. In this paper we address the narrower issue of what is encoded on metaphoric 
expressions in a WordNet-like resource - ItalWordNet - and what could instead be encoded and how. 
We explore the issue by taking into account the occurrences of the verb colpire (to hit/to strike/to 
shoot) and the noun colpo (blow/stroke/shot) in a corpus of Italian and in the ItalWordNet database. 

1. Introduction 

Since the influential work by Lakoff and Johnson (1980) much research has been devoted to the issue 
of dealing with metaphoric expressions in text, considering metaphor not just as a poetical way of 
speaking, rather as something which is deeply embedded in our language, culture and the way we 
think. Metaphor affects how we experience and interact with the world and other people: “the human 
conceptual system is metaphorically structured and defined. Metaphors as linguistic expressions are 
possible precisely because there are metaphors in a person’s conceptual system.” (Lakoff and 
Johnson, 1980: 6). We use a limited set of image-schemas tc build not only basic concepts, but also 
concepts which are not directly linked to physical experience. Thus, metaphoric linguistic expressions 
are manifestations of ‘conceptual metaphors’, i.e. metaphorical structures which are present in our 
mind and relate a concrete source domain with a more abstract target domain. In other words, 
metaphoric linguistic expressions are the superficial realization of the mapping we perform between 
conceptual domains. At least two consequences follow from this perspective which should be 
considered when building a lexicon: 

i) metaphorical extension of word senses is a kind of regular polysemy (cf. Apresjan, 1980): e.g., He 
arrived ( came here or was bom) when we were 20; He left us ('went away ' or ' died) after 
some time ; 

ii) generalizations govern inference models, i.e. those cases in which an inference model from a 
certain conceptual domain is used in another domain: e.g., In our relationship we have faced 
many obstacles ' -9 It has been difficult to go ahead. 

The problem we are facing, in the context of the ISLE (International Standards for Language 
Engineering) project 2 , is to what extent information on metaphors should be encoded in a 
multilingual lexicon. In this paper we address the narrower issue of what is encoded on metaphoric 
expressions in a WordNet-like resource — ItalWordNet 13 (IWN) — and what could instead be encoded 
and how. 

The present paper is the outcome of a collaborative effort. For the specific concerns of the Italian Academy only, A. 
Alonge is responsible for sections 2 and 4; M. Castelli for sections 1 and 3. 

The ISLE project is the prosecution of the EAGLES initiative (Expert Advisory group for Language Engineering 
Standards), which has seen successful development and broad deployment of a number of recommendations and de facto 
standards. ISLE aims at developing Human Language Technology standards within an international framework, in the 
context of the EU-US International Research Cooperation initiative. 

13 ItalWordNet has been developed within the SI-TAL (Integrated System for the Automatic Treatment of Language) Italian 
project, devoted to the creation of large linguistic resources and software tools for the Italian written and spoken language 
processing. The database has been built by extending the Italian wordnet realized within the EuroWordNet project, but has 



There has been much research devoted at the issue of what a word-sense is and if word-senses ‘exist’ 
at all and should be considered as the basic units of the lexicon. Although we agree with views 
according to which “word senses exist only relative to a task” (Kilgarriff, 1997: 1), and are at the 
same time appealed by proposals for ‘coarse coding’ (Harris, 1994) 14 , we still believe that a WN-like 
structure, taking the concepts and the synsets referring to them as the ‘building blocks’ of the (mental) 
lexicon, is both appropriate as a representation of lexical knowledge (with the basic idea of a net 
linking the concepts) and can be used as a resource for NLP, provided that the possible uses and 
actual limits of such resource are kept clear. 

2. Metaphoric use of Language and Lexical Resources 

Sense distinctions vary widely across lexical resources. Different dictionaries distinguish among 
different senses of words in a sort of arbitrary way since they are strongly influenced by the purpose 
of the resource (the target audience), and have different editorial philosophies with respect to 
‘lumping vs. splitting’ of senses (see, e.g., Atkins, 1993; Kilgarriff, 1997). Dictionaries normally 
contain distinctions among ‘literal’ vs. ‘figurative’ meaning within a lexical entry. However, such 
information is in general, at best, ‘incomplete’: 

i) information on metaphoric uses is not systematic in many sources, and different sources contain 
different information; 

ii) potential metaphors are not encoded; 

iii) when information on metaphorical sense extensions is present, there is generally no clear 
indication of the connection between the ‘basic’ and the ‘extended’ senses. 

EuroWordNet (EWN) first and IWN then were built using as source data dictionaries available in 
machine-readable form, thus they contain inconsistencies and shortage of data inherited from 
dictionaries. While building first the Italian wordnet in EWN and then IWN, the problem of finding a 
coherent and principled way to deal with metaphoric extensions of word senses was always present 
but somehow neglected mainly because of the time limits of the projects. However, in order to 
(automatically) deal with real texts the issue has to be tackled identifying what and how should be 
encoded. 

Consider, for instance, the verb lasciare (to leave). It can occur in sentences 1. and 2. below: 

1. Ci ha lasciati quando aveva solo 20 anni. 

(He left us when he was only 20) 

2. Ci ha lasciati. 

(He left us) 

Possible interpretations of 1. can be either ‘He left us to go somewhere’ or ‘He died’; 2. seems more 
immediately interpreted as ‘He died’. The conceptual metaphors implied by this use of the verb are: 
‘BIRTH IS ARRIVAL’, ‘LIFE IS BEING PRESENT HERE’, ‘DEATH IS DEPARTURE’. Actually, 
the metaphoric use of the verb is very common, however such a sense is not distinguished either in 
IWN or in the dictionaries used to build it. 

A second example can be made to illustrate the problem of the lack of connections between basic and 
metaphorical, extended senses. The Italian verb separarsi (to separate, to divide) has a metaphorical 
sense extension clearly encoded in sources. Within IWN we find, among the others, the following 
synsets containing the verb: 

{separarsi 1, staccarsi 3, dividersi 2} = allontanarsi da qualcuno (to separate, to go far from someone) 


inherited from EuroWordNet its main characteristics (general structure, relations, etc.; see the EuroWordNet website for 
complete information on the project and its results: http;//w\vw,hwn.uva.nl/-cwn/gwa.htm; see Alonge et al., 2000 and 
Roventini et al., forthcoming, for a detailed description of the IWN database). 

14 As she states it, the ‘short form’ of Harris’s proposal is “that both language form and meaning are stored in chunks larger 
than a word, and that therefore the meaning of words is usually tightly linked to their typical contexts of occurrence’’ (p. 12). 


343 




{separarsi 4, lasciarsi 1, dividersi 4} = rompere un legame con qualcuno, specialmerite di coppia (to 
divide, to break a relationship). 

Since enough information is encoded in IWN on the two senses of the verb, the database could be 
used to disambiguate it in the following sentences: 

3. Si separarono dopo una lunga camminata. 

(They separated after a long walk) 

4. Si separarono dopo un lungo periodo di crisi. 

(They separated after a long period of crisis). 

The metaphor involved in this case is that according to which ‘LOVE IS A JOURNEY’ and the sense 
extension seen is a case of regular polysemy since a whole set of expressions used to refer to motion 
can be used to refer to love relationships. Thus, we could say that a man and a woman have started 
walking together (when they start a love relationship) and then they have been stopped by some 
problem, etc., using already attested metaphoric expressions, but also creating new ones. In IWN there 
is no indication of the connection of these two senses of separarsi , while it could be useful to have 
information on the existence of such a regularity of polysemy, in order to deal with novel metaphoric 
expressions involving verbs, or also other parts of speech, referring to the same basic conceptual 
domain. 

When comparing corpus occurrences of words with information encoded in IWN, or also in other 
lexical resources, one normally sees that there is a surprisingly high frequency of figurative senses in 
real texts and most of these senses are not described in such resources (cf. Nimb and Sanford 
Pedersen, 2000, for data identified within the SIMPLE EC project and the solutions proposed in that 
context). The question is: how should these figurative senses be accounted for in a WN-like resource 
(in particular, in IWN)? And: how should novel, potential uses of words be dealt with by referring to a 
resource Such IWN? In the next section we illustrate the problem more in details by analysing corpus 
occurrences of the verb colpire (to hit/to strike/to shoot) and the noun colpo (blow/stroke/shot ), which 
have been taken into consideration within ISLE to explore all the issues connected with the encoding 
of data within a multilingual lexical entry. In the concluding section we try to make some proposals. 

3. Analysis of colpire and colpo 

In going through corpus occurrences of colpo and colpire our goal was that of identifying metaphoric 
uses in order to find possible ways of representing this information in a multilingual lexicon. Here we 
narrow our aim at proposing improvements for a resource such as IWN. 

The steps followed in the research we are discussing here have been: 

i) identification of the senses of colpo and colpire at monolingual level in IWN; 

ii) for each identified sense, analysis and survey of part of the collocations extracted from the Italian 
corpus, and comparison with data encoded in IWN. 

3.1. Analysis of colpire 

A complete analysis has been carried out for the verb occurrences by taking into account various 
aspects of the verb semantics (its Aktionsart/Tense/Aspect; arguments; etc.). However, the elements 
which turned out to be more relevant to emphasize the differences among the senses are: 

i) the verb semantic preferences 

ii) the kind of effect produced - positive or negative (up or down in Lakoff and Johnson (1980) 
terms) - by the event referred to. 


344 



The following are the IWN synsets containing the verb: 

1’. {colpire} - dare un colpo (to hit) 

2. {colpire} - danneggiare (fig.) (to damage) 

{raggiungere, cogliere, prendere, colpire} — cogliere, raggiungere (detto di persona o di cosa) (to 
hit the target) 

4. (stressare, traumatizzare, sconvolgere, impressionare, colpire, scioccare} - impressionare (to 
strike, to impress) 

By clustering the concordances found for colpire in the corpus, we found a ‘gradation’ of well 
established uses (identifying specific senses of the verb), where we go from a more basic ‘literal’ 
sense to a more ‘figurative’ one, passing from different levels of ‘transparency’ of metaphor. For each 
metaphoric use we tried to identify a general cognitive metaphor to which groups of metaphoric 
expressions can be linked. 

1) Le azioni sono armi (Actions are weapons^ 

# 

ARG1: [-^animate]; ARG2: [+animate] 
effect: strictly physical damage (negative) 

il <mostro> *colpi* per I'ultima volta 
(the ‘monster’ hit for the last time) 

Per 15 anni, *colpi*persone indifese nei caffe, sui treni, nelle stazioni 
(For 15 years, he hit defenceless people in bars, trains, stations) 

[SatanaJ *Colpi* Giobbe con un'infezione maligna 
(Satan struck Job with a malign infection) 

This sense might be linked to IWN sense 2, however the IWN synset is more, general and could be 
considered as a hyperonym of this sense. 

2) “I provvedimenti sociali/politici sono un’arma” (Social/political measures are weapons) 

ARG1: [±animate] - can be the measure or the people adopting it; ARG2: [fanimate] - can be the 
people affected or some characteristics of them 
effect: moral/economic damage (negative) 

Negli anni 80, da giudice istruttore, *colpi* duramente le bande dei sequestratori 
(In the 80s, as a judge, he hard hit the gangs of kidnappers) 

del Governo Amato che, con la minimum tax, *colpi* duramente il lavoro autonomo 
(of the Amato government which, with the minimum tax, hard hit autonomous workers) 

I'enciclica *colpira* soprattutto il clero 
(the encyclic will strike especially the clergy) 

Again this sense might be linked to IWN sense 2 which is however more general, underspecified. 

3) “I fenomeni natural! (gli event! incontrollabili/le malattie) sono un’arma” (Natural 
phenomena - uncontrollable events /illnesses - are weapons) 

ARG1: [-animate]; ARG2: [tanimate] - can be people or a place 
effect: physical damage (negative) 

Nel luglio 1995 una eccezionale ondata di caldo *colpi* gli Usa, Chicago inparticolare 
(In July 1995 an exceptional hot wave struck the U.S.A., in particular Chicago) 
nel 1989 un terremoto *colpi* I'area di Tipiza, 

(in 1989 an earthquake struck the area of Tipiza) 


345 



"UAlzheimer *colpi* Reagan quando erapresidente 
(Alzheimer struck Reagan when he was president) 

We still see a sense which does not have a direct correspondent in IWN, but could be a hyponym of 
sense 2. 

4) “H corpo/La mente e uno strumento” (The body/the mind is an instrument) 

“Le parti del corpo sono strumenti” (The parts of the body are instruments) 

“Le parole/il parlare sono/fe strumenti/uno strumento” (Speaking is an instrument - this 
metaphor is linked to the already attested metaphors: ‘An argument is a war’ (where the 
instrument is ‘negative’); ‘Communication is the act of sending something’ (positive if 
underspecified)) 

“Le idee sono uno strumento” (Ideas are instruments) 

“I sentimenti sono uno strumento” (Feelings are instruments) 

ARG1: [-animate]; ARG2: [±animate] - can be people or a place 

effect: psychological; effects can be positive or negative depending (on the judgement) on the 
positivity/negativity of the ‘object’ striking. 

II suo viso mi *colpi*: i suoi capelli erano un po'troppo fieri 
(His/her face struck me: his/her hair was too black) 

"Laprima cosa che mi *colpi* di lei fu che emanava femminilitd 
(The first thing which struck me of her was that she exhaled femininity) 

Mi *colpi* la suapelle, ancorapiu bianca della mia 
(I was struck by her skin, still more white than mine) 

This sense of the verb may be expressed also by means of the phrase fare colpo which always seems 
to refer to a positive impression: 

L'idea *fa colpo*. 

(The idea is impressive) 

scrivevapoesie, <magariper *far colpo* sulle ragazze> 

(he used to write poems, “to try to impress girls”) 

This sense may be linked to IWN sense 4. 

3.2. Analysis of colpo 

In IWN the following senses are distinguished for colpo: 

1. {urto, percossa, colpo} — colpo che si dh a qlcu. o a qualcosa, con la mano o con altro (blow, 
stroke etc., dealt with an instrument, a hand or something else) 

2. {colpo} - azione rapida o violenta tipica dello sport (sudden action, in sport) 

3. {sparata, fuoco, tiro, sparo, colpo} - ogni colpo di arma da fuoco (shot) 

4. {colpo, accidente, malore} -- indisposizione, male fisico improwiso (sudden illness) 

5. {colpo} — un grave danno (serious damage) 

6. {colpo} - i pugni nella box (punch, blow, in boxing) 

7. {infarto, colpo, attacco cardiaco} - attacco cardiaco (heart attack) 

We have analyzed the corpus trying first to verify if the cognitive metaphors identified for the verb 
senses can also be extended to the nouns. In what follows, however, we are only going to discuss the 
most relevant facts emerged from the analysis carried out, leaving out many multiword expressions 
found which would deserve a too detailed discussion, going beyond the scope of this paper. 


346 


Cognitive metaphors found for the verb: 

1) Le azioni sono armi (Actions are weapons) 

A metaphoric use which could be linked to this general metaphor is not present in the corpus, where 
we pass from more ‘literal’ uses to metaphoric expressions which can be linked to the 2 nd group of 
cognitive metaphors seen above. 

2) “I prov vedimenti sociali/politici sono un’arma” (Social/political measures arc weapons) 

effect: moral/economic damage (negative) 

*colpo* inferto all'inchiesta 
(stroke inflicted to the inquiry) 

Shamir ha assestato ieri il suo, decisivo, *colpo * alia legislatura 
(Shamir dealt his decisive blow to the legislature yesterday) 

L'ultimo < *colpo*> deigiudici sembra aver lasciato L'Aquila quasi indifferente- 
(the last stroke of the judges seems to have left L’Aquila almost indifferent) 

We also find fixed multiwords which are connected with this sense: 

Banche, *cofoo * di scure sugli utili 
(Banks, ‘axe blow’ on the profits) 

Nessun *colpo* di smizna ma neanche una dissipazione della politico 
(No sponge passed over but neither a dissipation of politics) 

There is no such sense clearly distinguished in IWN, however sense 5 could be considered as a 
hyperonym of it. 

3) “1 fenomeni naturali (gli eventi incontroirabili/le malattie) sono un’arma” (Natural 
phenomena - uncontrollable events/illnesses - are weapons) 

effect: physical damage (negative) 

pare uti rospo che ha avuto un *colpetto* ed e rimastoparalizzato 
(it seems a toad that had a small stroke and remained paralysed) 

Examples of this sense of colpo are not very frequent in the corpus but within fixed multiword 
expressions: 

per evitare incidenti da *coIdo* di calore 
(to avoid accidents due to heat-strokes) 

L'eritema cutaneo e il classico *cobo * di sole che compare nelle ore successive a una irradiazione 
(Skin erythema is the typical sun-stroke which appears in the hours following an irradiation) 

Poi il *coIdo* di sonno e la sciagura 
(Then the sleep stroke and the disaster) 
etc. 

This sense could be linked to sense 4 in IWN. 

4) “Il corpo/La mente fe uno strumento” (The body/the mind is an instrument) 

“Le parti del corpo sono strumenti” (The parts of the body are instruments) 

“Le parole/il parlare sono/fe strumenti/uno strumento” (Speaking is an instrument- linked to 
the already attested metaphors: An argument is a war (negative instrument); Communication is 
the act of sending something (positive if underspecified)) 


347 



“LE IDEE SONO UNO STRUMENTO” (IDEAS ARE INSTRUMENTS) 

“I SENTIMENTI SONO UNO STRUMENTO” (FEELINGS ARE INSTRUMENTS) 

effect: psychological; effects can be positive or negative depending (on the judgement) on the 
positivity/negativity of the ‘object’ striking. 

Example of this metaphor for colpo do not seem present in the corpus. Actually, there a few are cases 
which are unclear between this reading and (2) above. 

A sense which is very frequent in the corpus is that of {colpo, rapina} = robbery, job, heist*: 

*Colpo * da cento milioni alVufficio postale 
(Heist of a hundred millions at the post office) 

Gaetano Lesi, 39 anni, si e accorto del * colpo* ieri mattina, quando ha trovato la cassaforte aperta 

(Gaetano Lei, 39 years old, became aware of the heist yesterday morning, when he found the strong¬ 
box open) 

This has also an extended meaning: 

*Colpo * grosso per la Bugatti 
(Master-strocke for Bugatti) 

*Colpo* a sorpresa della Citroen 
(Surprise stroke of Citroen) 

etc. • > v. . ... •_ 

Strangely this sense is missing within IWN and in its source dictionaries, while it is encoded in other 
dictionaries of Italian (like, for instance, Zingarelli). 

4. Conclusions 

Information to be encoded at the synset level and generalizations at a higher level 

As it is clear, IWN lacks precise information on very frequent metaphoric uses of colpire and colpo. 
By clustering corpus occurrences extracted from a general corpus of Italian it is possible to identify 
senses which could be added to the database to provide both a better account of a speaker’s lexical 
knowledge and a set of data which are useful for various NLP tasks. Indeed, the data provided show 
that by analyzing a large general corpus various metaphoric expressions are clearly distinguishable 
which are not (consistently) identified in IWN or other resources. Since the necessity of adding 
corpora as sources for lexicons is probably undebatable, our main point is that one should deal with 
these issues by adopting a well established and generally accepted theoretical framework like that 
proposed by Lakoff and Johnson (1980), within which a large system of conventional conceptual 
metaphors has been recognized. By adopting that perspective many subtle, but relevant, differences 
may be highlighted in a principled way. These should be encoded at the synset level to account for 
already well established word senses. Of course, no lexical resource will probably ever be able to 
exhaustively account for the phenomenon which Cruse (1986) termed modulation , determining that “a 
single sense can be modified in an unlimited number of ways for different contexts, each context 
emphasizing certain semantic traits, and obscuring and suppressing others” (Cruse, 1986: 52). 
However, each resource should be designed so to be. as complete and coherent as possible. 

What remains to be deepened is the issue of how to encode information on the systematicity of 
conceptual metaphors. When we understand novel metaphoric expressions we make reference to a 
system of established mappings between concrete conceptual domains and abstract ones (e.g., the 
above mentioned mapping between the journeys domain and that of love relationships). That is, there 
is a pre-existent knowledge which constrains our possibility to produce and/or understand* novel 
metaphoric expressions. In order to build a resource which actually accounts for our lexical- 
conceptual knowledge, we have to find a way to encode also knowledge about possible mappings 


348 



between conceptual domains resulting in metaphoric expressions production. This information should 
be encoded at a higher level than the synset level, since it is information on regular polysemy 
affecting whole conceptual domains. 

In IWN, as in EWN, we have three fundamental levels of representation of semantic information: 

i) the synset level, where language-specific synsets information is encoded; 

ii) the level of the linking to the Interlingual-Index (ILI - an unstructured list of WN 1.5 synsets) to 
which synsets from the specific wordnet point in order to perform the linking between different 
language-specific wordnets; 

iii) the Top Ontology (TO), a hierarchy of language-independent concepts, reflecting fundamental 
semantic distinctions, which may (or may not) be lexicalised in various ways, or according to 
different patterns, in different languages: via the ILI, all the concepts in the language specific 
wordnet are directly'or indirectly (via hyponymy relations) linked to the TO. 

Since the distinctions at the level of the TO are language independent, it is necessary to show 
metaphoric regular polysemy found in a specific language at a different level. Indeed, there are 
culture-constrained differences in the metaphor system (see, e.g., the differences linked to orientation 
reported by Lakoff and Johnson, 1980, determining for instance that in some cultures the future is in 
front of us and in others the future is behind us) which should receive a representation at some other 
level. ' 

In EWN some cases of regular polysemy were dealt with at the level of the linking of each language- 
specific wordnet with the ILI. Via the ILI the generalizations over concepts were projected to the TO. 
Generalizations were stated directly at the level of the ILI and automatically inherited from all the 
synsets which in a language-specific wordnet were linked to the ILIs involved in the generalizations 
themselves. An automatically added generalization could be later manually deleted in case it did not 
apply to a specific language (cf. Peters et al., 1998). For instance, the lexeme scuola (school) in Italian 
has got (among others) two senses indicating one the institution and the other the building. This is a 
case of regular polysemy, since many words indicating institutions also indicate buildings in Italian 
(as, of course, in other languages). Once we linked the school-institution and the school-building 
synsets to the appropriate synsets in the ILI, the system automatically added to each Italian synset 
another equivalence link, called EQ_METONYM, to a kind of ‘composite ILI’, clustering the 
‘institution’ and ‘building’ ILIs into a coarser-grained sense group. Thus, our synsets, via the ILI, 
were linked to tops in the TO indicating concepts in different domains. A similar operation was 
automatically performed for senses reflecting diathesis alternations for verbs (related by 
EQ_diathesis), such as causative and inchoative pairs. In case a kind of regular polysemy did not 
display in our language, we had to manually delete the automatically generated link to the relevant 
composite ILI. 

In IWN the composite ILIs have not been used. However, we think that they could instead be adopted, 
by creating a much larger set of them, to account for regular metaphoric extensions of senses. In order 
to deal with culture-constrained differences in the metaphor system, instead of a priori identifying a 
set of composite ILIs to be automatically added to a language-specific wordnet (and eventually 
deleted, with various practical problems), it would be better to have the possibility to create for each 
language new composite ILIs which could eventually be shared among languages. Via the ILI links 
the connection between specific synsets in a language would also be shown at the TO level as 
connection (mapping) between concepts (conceptual domains). 

References 

Alonge A., Bertagna F., Calzolari N., Roventini A., Zampolli A. (2000) Encoding Information on 
Adjectives in a Lexical-Semantic Net for Computational Applications. In “Proceedings of the 1 st 
Conference of the North American Chapter of the Association for Computational Linguistics”, 
Seattle, pp. 42-50. 

Apresijan J. (1980) Regular Polysemy. Linguistics, 142, The Hague, Mouton. 


349 



Atkins, B. (1993) Building a Lexicon: The Contribution of Lexicography, International Journal of 
Lexicography, 3, pp. 167-204. 

Cruse, D. A., (1986) Lexical Semantics. Cambridge University Press, Cambridge. 

Lakoff G. and Johnson M. (1980) Metaphors We Live by. University of Chicago Press, Chicago. 

Harris C. (1994) Coarse Coding and the Lexicon. In “Continuity in Linguistic Semantics”, Fuchs, C. 
and Victorri, B. (eds.) Benjamins, Amsterdam. 

KilgarrifT A. (1997) "I don't believe in word senses”. Computers and the Humanities 31 (2), pp. 91- 
113. 

Nimb S. and Sanford Pedersen B. (2000) Treating Metaphoric Senses in a Danish Computational 
Lexicon - Different Cases of Regular Polysemy. Ms. 

Peters W., Peters I., Vossen P. (1998) Automatic Sense Clustering in EuroWordNet. In “Proceedings 
of the 1st International Conference on Language Resources and Evaluation”, Granada, Spain, pp. 409- 
23. 

Roventini, A., A. Alonge, F. Bertagna, N. Calzolari, J. Cancila, R. Marinelli, A. Zampolli, B. 
Magnini, C. Girardi, M. Speranza (forthcoming), ItalWordNet: Building a Large Semantic Database 
for the Automatic Treatment of Italian , Linguistica Cornputazionale, Giardini, Pisa. 

Antonietta Alonge, Sezione di Linguistica, Facolty di Lettere e Filosofia, University di Perugia, Piazza Morlacchl, 
11, Perugia 06100- ITALY, antoalonQe@libero.it 

Margherita Castelll, Sezione di Linguistica, Facolty di Lettere e Filosofia, University di Perugia, Piazza 
Morlacchl, 11, Perugia 06100 - ITALY, castelli@uniDQ.it 


350 



Abstract 


What Does It Mean To Be A Shelf? 

Semantic Bleaching and WordNet 

Sandiway Fong 


In English, denominal verbs incorporate in varying degrees the meaning of the 
root noun as part of the verb’s meaning. For example, one can box a present in a 
gift box but not in a paper bag, shelve a book on the mantelpiece but not on a spike. 
Other verbs such as land and warehouse exhibit bleaching to a much greater degree; 
for example, one can land a hydroplane on water, or warehouse parts in a barn, 
silo or any structure. In this paper, we describe the advantages and shortcomings 
in modeling semantic bleaching using WordNet’s hypernym/hyponym hierarchy, 
suggesting, along the way, directions for further refinement of the isa-relation. 


1 Introduction 

WordNet, Fellbaum (1998), provides a rich array of semantic relations that can be ex¬ 
ploited for natural language tasks involving semantic inference. The hypernym/hyponym 
hierarchy represents one of these relations defined over nouns. Denominals verbs, be¬ 
ing derived from nominal roots, inherit substantial semantic properties from associated 
nouns. In this paper, we show that the structure and organization of WordNet’s noun 
hierarchy has empirical consequences for the verbal system with respect to denominals. 


1.1 Non-Bleaching Denominals 

In English, nouns can often function as location or locatum verbs, incorporating the 
nominal to a greater or lesser extent as part of the denominal verb’s meaning. Kiparsky 
(1997) uses the term “bleaching” to express the degree of attenuation in nominal meaning. 1 
For instance, consider the examples shown in (1), adapted from (?). 

(1) a. John boxed the present 

b. John put the present in a <box> 

c. John boxed the present in a gift box 

d. # John boxed the present in a brown paper bag 

(2) a. Mary buttered the piece of toast 

b. Mary put <butter> on the piece of toast 

c. Mary buttered the toast with margarine/unsalted butter 

d. # Mary buttered the toast with marmalade/onions 

1 The term semantic bleaching is more conventionally used in the linguistics literature in the context 
of language change. 



The location verb box in (la) has the informal meaning given in (lb). Here, PUT 
represents the underlying core verb, and angle brackets < ... > are used to indicate the 
nominal constant that is lexicalized as a verb in the sense of (?). According to Kiparsky, 
box permits only a restricted range of possible location adjuncts, as shown in (lc) and 
(Id). Similarly, the locatum verb butter in (2a), paraphrased in (2b), limits the choice 
of locatum adjuncts to either butter or its direct substitute, margarine, as shown in (2c) 
and (2d). 2 3 

1.2 Partial Bleaching 

Interestingly, other denominate permit a wider, though still partially restricted, range of 
modification, e.g. location shelve and locatum bread , as shown in (3) and (4). 4 5 

(3) a. Peter shelved a book 

b. Peter shelved a book on the windowsill/mantelpiece/table/stand 

c. Peter shelved a book on the ball/spike/ceiling/floor/balcony 

(4) a. Sue breaded the fish 

b. Sue breaded the fish with breadcrumbs/shredded coconut/crushed almonds 

c. # Sue breaded the fish with marmalade/butter/treacle/ice 

In (3), the location shelf may be replaced by “shelf-like” objects such as windowsills 
and tables , but not by other objects like spikes or balconies. This partial “bleaching” of 
shelf can be encoded as follows: 

(5) a. x put y ON <SHELF> 

b. x put y ON z & shelf-like-object(z) 

More formally, the concept of partial bleaching for shelf involves the replacement of 
the concrete constant <SHELF> with a variable z restricted by the predicate shelf-like- 
object. This paper explores whether and how clustering of semantic relations around shelf 
in a network like WordNet can be used to define a concept such as shelf-like-object. 

Another case of partial bleaching is given in (4) for the locatum verb bread. This exam¬ 
ple crucially involves the additional concept of crumbs or small particles, as the contrast 
between the examples in (4b) and (4c) indicates. That is, (4a) cannot be paraphrased 
using (6a); instead it is more accurately modelled by (6b). 

(6) a. # X PUT <BREAD> ON y 

b. x put crumbs of <bread> ON y 

c. x PUT crumbs of z on y 

Partial bleaching in this case is encoded by the substitution of <BREAD> with the 
unrestricted variable z, as represented in (6c). 6 

2 Actually a Web search reveals other possible substitutions for gift box including: case, album, con¬ 
tainer, trunk, cylinder, carton, crate, casing, coffin, tube, suitcase, slipcase, binder, clamshell, chest, tin 
and cabinet. 

3 Some readers may find margarine unacceptable in (2c). However, the point here is that butter exhibits 
extremely limited bleaching. 

4 (3b) and (4b) are constructed from actual examples found on the Web. 

5 The notion crumbs of z needs to be further clarified to deal with cases where the relevant entity 
already comes in the form of small particles, e.g. pork chops breaded with pumpkin seeds or Boneless 

center cuts of pork loin, breaded with cracked black pepper. 


352 



<noun.artifact> shelf 

— (a support that consists of a horizontal surface for holding objects) 

=> <noun.artifact> support 

— (any device that bears the weight of another thing) 

Figure 1: WordNet glosses for shelf and support 

1.3 Full Bleaching 

Still, other denominals allow complete bleaching to take place. For example, consider the 
location verbs in (7). 

(7) a. to land a hydroplane on water 

b. to dump garbage by the roadside 

c. to ditch a car in a vacant lot 

d. to warehouse the empty crates in the silo 

The examples in (7) indicate that one can land , dump , ditch or warehouse an object 
or objects anywhere. Here, we use the term “denominal” to encompass what Kiparsky 
terms as true and apparent denominals. In other words, we assume the verbs in (7) are 
semantically related to the corresponding nominals via the template in (8): 

(8) <DENOMINAL> = X PUT y IN/AT location(<NOMINAL>) 

Another possibility is that apparent denominals like dump and ditch may be related to 
nominals via a common root, as suggested in (?). 6 

In a similar fashion, the meaning of locatum verbs blanket and blindfold can be com¬ 
pletely diluted or basically paraphrased as “cover”, as shown in (9) and (10). 7 

(9) a. highways blanketed with fog 

b. burgers blanketed with onions 

c. streets blanketed with cars 

d. a steep embankment blanketed with dense foliage 

(10) blindfolded with his own shirt/duct tape/a filthy rag/a teacosy 

To summarize, there appear to be at least three classes or levels of denominals with 
respect to the phenomenon of bleaching. First, denominals like box and butter more or 
less retain “the full force of the corresponding noun”, to use Kiparsky’s words. Second, 
verbs like shelve or bread permit the substitution of shelf -like objects or objects that can 
be broken down into crumbs , respectively. Finally, denominals such as land , dump , ditch 
or blanket and blindfold , allow the nominal meaning to be fully diluted or bleached. 

2 WordNet and Bleaching 

The main question explored in this paper is as follows: 

Can WordNet be used to predict the degree of bleaching for denominals? 

6 In this paper, we are primarily interested in synchronic data. Of course, historically speaking, dump 
and ditch as verbs (but not land or warehouse) pre-date the related nominal forms. 

7 These are actual examples are taken from the Web. 


353 



As a first stab at the problem, it seems appropriate to make use of the hierarchical 
semantic structure represented by WordNet’s hypernym/hyponym relation, which is 
designed to encode the isa (is a) or aka (a kind of) relation. In particular, consider the 
following hypothesis, shown in (11). 

(11) Denominal root Y may be bleached using X if X is a hyponym* of Y 8 

2.1 Example of a Partially Bleaching Verb: Shelve 

Consider again the location verb shelve, as shown in (12) (=3b). 

(12) Peter shelved a book on the windowsill/mantelpiece/table/stand 

In WordNet, shelf as a horizontal support has the following hyponyms: 

(13) bookshelf, hob, mantel, mantelpiece, mantle, chimneypiece, overmantel 

A Web search was performed, revealing the following 9 distinct shelf- like objects and 
confirming the limited possibilities for bleaching with respect to shelve : 9 

(14) windowsill, mantel, case, radiator, table, stand, carrel, bookstand, bookshelf 

(13) and (14) intersect for examples mantel and bookshelf only. 

To broaden the notion of shelf note that it is defined in WordNet to be an instance 
of the concept support , as shown in Figure l. 10 WordNet does not make a distinction' 
between functional and non-functional isa-relations in the hypernym/hyponym hierarchy, 
as noted by (?). Here, shelf bears a functional isa-relation with respect to support , he. a 
shelf functions as a kind of support , to be distinguished from the type of tsa-relation that 
obtains between, say, bookshelf and shelf. The relevance of the distinction will be made 
clear below. 

Let us tentatively revise the definition in (11) as follows: 

(15) Denominal root Y may be bleached using X if 

a. X is a hyponym* of Y, or 

b. Z is a functional hypernym + of Y, and X is a hyponym + of Z 11 ’ 12 

Using (15b), we can account for windowsill and ( book)stand in (14). A simple search 
reveals that these two items are related to shelf via the notion of support , as illustrated 

8 * is used to denote the reflexive transitive closure operation. For the base case when X=Y, other 
criteria also come into play, e.g. the introduction of new or crucial information, as in land a hydroplane 
on dry land. 

9 These were manually extracted from all 850 results returned by Google using the keywords: shelved 
+on. A similar protocol was used in all searches described here. Note that the potential for polysemy 
requires manual intervention to exclude examples such as maps are shelved on the back wall , periodicals 
are shelved on many floors and tiers , and the UN has shelved a US resolution on China. 

10 Although support is the functional superordinate of shelf, WordNet encodes other kinds of superor¬ 
dinate relations. For example, the holonym relation indicates that a shelf can also be part of a bookcase , 
counter, cabinet, closet or bureau. 

11 -f represents the transitive closure operation. 

12 For the purposes of this paper, we put aside the important problem of how to define functional 
hypemymy. This information must be introduced from sources external to WordNet. See also note 14. 


354 





(a) (b) 

Figure 3: Paths for shelf and table 


in Figure 2. In fact, Figures 2(a-b) represent the shortest possible link between the 
concepts. 13 

However, (15b) does not allow us to account for table. The shortest path between 
table and shelf is illustrated in Figure 3(a). This path contains higher concepts such 
as device , instrumentality and furnishings that are in no way shelf -like. From another 
viewpoint, complete bleaching in the WordNet hierarchy occurs when non-functional 
relations, such as that between furnishings and instrumentality, are required to complete 
the derivation. Hence, the restriction to functional zsa-relations in (15b). 

The commonality we seek between table and shelf is that they’re both horizontal or 
flat surfaces capable of support. Given this, there exists a functional relationship between 
the two concepts not currently represented in WordNet. One possible implementation 
is to refine the concept of support to reflect a feature [horizontal] (or flat), as shown 
in Figure 3(b). In fact, something of this form is independently necessary to prevent (15b) 
from overgenerating, as (16), a list of the unqualified hyponyms of support, indicates. 

(16) andiron, firedog, dog, dogiron, arch support, back, backrest, backboard, baluster, 
base, pedestal, stand, bearing, bearing wall, bedpost, bookend, brace, bracket, 
bridge, foot, foothold, footing, handrest, hanger, harness, harp, headstock, leg, 
perch, pier, pillow block, rack, stand, rest, rib, rocker, seat, shelf, skeg, sling, 
spoke, radius, step, stair, stirrup, stirrup iron,stock, gunstock, structural member, 

13 We adopt the breadth-first WordNet search engine described in (?) to find the shortest connection 
or path between concepts. 


355 












Figure 4: WordNet hierarchy for blanket 


tailstock, tee, football tee, undercarriage, yoke 

2.2 Examples of Non-Bleaching Verbs: Asphalt and Tarmac 

Asphalt and tarmac , along with butter in (2), are examples of non-bleaching denominals. 

(17) a. the crew asphalted/tarmaced the road with fresh asphalt/new tarmac 

b. # the crew asphalted/tarmaced the road with concrete 

c. # the crew asphalted/tarmaced the road with cobblestones 

Asphalt and tarmac both lie at the very bottom of WordNet’s hypernym/hyponym 
hierarchy. Hence the first part of the bleaching rule (15), repeated here as (18), admits 
no candidates. •=.*. 

(18) Denominal root Y may be bleached using X if 

a. X is a hyponym* of Y, or 

b. Z is a functional hypernym + of Y, and X is a hyponym + of Z 

Both nominate are instances of paving material , a functional superordinate with the 
following set of hyponyms: 

(19) asphalt, concrete, cement, reinforced concrete, ferroconcrete, blacktop, blacktop¬ 
ping, macadam, taramacadam, tarmac, paving, pavement 

Hence (18b) incorrectly rules in (17b). A potential fix for this is to hypothesize that 
(18b) applies only when the set of X satisfying (1.8a) is non-empty. This amounts to 
hypothesizing that leaf nodes are always non-bleachable: 

(20) Denominals derived from leaf nodes are non-bleachable 

Unfortunately, (20) is unmaintainable. Asphalt and tarmac belong to the class of 
Butter Verbs, see (7), with the template in (21), generalizing (2b). 

(21) x PUT <y> ON/lN z 

where y represents the noun from which the verb is derived. 

Two other members of this class, blanket and blindfold , are also represented by leaf 

nodes in Word.Net. However, as was seen previously in (9) and (10), these are highly 
bleachable verbs. How can we explain the bleachability of verbs like blanket ? 

The hierarchical structure relevant for blanket is given in Figure 4. As can be seen, 
WordNet distinguishes between artificial and natural coverings, coveringj and coverings 


356 






(respectively), a distinction not relevant in semantic bleaching. With respect to the 
bleaching rule shown previously in (18b), the superordinate nodes up to and including 
covering have functional value and those above do not. 14 

Hence, (18b) predicts that blanket is highly bleachable with any kind of covering, 
natural or otherwise, defined in the WordNet hierarchy. Many of these are listed in 
(22a) and (22b). 

(22) a. Natural coverings 

scale, shell, test, body, covering sheath, case, integument, blanket, mantle, 
crust, incrustation, encrustation, envelope, shell, eggshell, slough, peridium, 
pericarp, seed vessel, perianth, floral envelope, theca, sac, indusium, bark 
b. Artificial coverings 

artificial skin, bootleg, canopy, casing, cloak, cloth, covering, clothing, clothes, 
apparel, vesture, wearing apparel, wear, coating, coat, cover plate, fig leaf, flap, 
floorcover, floor covering, folder, footwear, footgear, imbrication, overlapping, 
lapping, instep, mask, mercy seat, paddlebox, protective covering, protection 
screen, cover, covert, concealment, swathing top, upholstery, wrapping, wrap, 
wrapper 

Given these examples, the functional concept of covering is clearly well-motivated and 
pertinent to the bleaching of blanket More precisely, the bleaching rule given by (18b) 
predicts the derivation of (23b) from (23a). 

(23) a. X PUT <BLANKET> ON/OVER Z 
b. X PUT <COVERING> ON/OVER Z 

Nevertheless, the WordNet definition is a limited one. Almost anything can function 
as a blanket. As (24) illustrates, a Web search reveals a large variety of entities. 

(24) snow, fog, parachutes, sauce, smog, debris, ash, flowers, glaze, wildflowers, ba¬ 
con, forest, garland, mixed grill, turkey, ham, smoke, compost, clippings, mulch, 
cheese, onions, plants, fallout, panels, bodies, pines, mixture, foliage, .tephra, blast 
material, craters, salsa, yogurt, shards, paper, scrub, cars, till, wilderness, loess, 
crabmeat, fondue, logos, landmines, deposits, Teflon, bags, turf, notices, bracken, 
heather, moss, mud, fronds, trees, groves, posters, handbills, doorknobs, powder, 
haze, sand, absorbent, leaves, stars, crickets, peanuts, plaques, foul air, particles, 
ice, rainforest, spruce, cedar, coating 15 

(23b) is essentially correct. However, for the purpose of semantic bleaching, and 
any other operation requiring the extension of the concept of covering , the node should 
augmented with a distinguished pointer to object , indicating the possibility of free substi¬ 
tution. 16 

14 Lending support to this is the fact that there is a switch in lexicographer’s file numbering. Blan¬ 
ket through covering are classified either as <noun.artifact>, in the upper half of Figure 4, or 
<noun.object>, in the lower half. Above covering , there is a change in classification, concepts arti¬ 
fact through entity are labeled generically as <noun.Tops>. 

15 We exclude from (24) metaphorical examples that were also reported including: love, details, 
color, white, lights, warm hearts, concern, enthusiasm, gray, memories, glory, protection, starry night, 
anonymity and responses. 

16 This is an oversimplification. Not all objects can function as a covering, e.g. # blanketed with air. 


357 



^arehoi«ey ^^ 5torehouse^ ^^ ^depositoiy^ ^ ^facflity^ -^ ^artifact^ 

Figure 5: WordNet hierarchy for warehouse 


2.3 Pocket Verbs: warehouse and spindle 

Locative denominals warehouse , as in (7d), and spindle , as in (25), are members of the 
class of Pocket Verbs , see (?), with the template in (26), generalizing (5a). 

(25) the dragon has been spindled on a spear 17 

(26) x put y on/in <z> 

where z represents the noun from which the verb is derived. 

Consider first the relevant fragment of WordNet hierarchical structure for warehouse 
shown in Figure 5. Here, arguably the functionality of warehouse is lost for nodes higher 
than depository. Again, the WordNet concept of a depository , given in (27), is a limited 
one, as almost any structure can be turned into a warehouse. 

(27) archive, archives, chancery, bank, bank building, drop, maildrop, postbox, mail¬ 
box, letter box, pillar box, library, depository library, athenaeum, atheneum, lend¬ 
ing library, circulating library, museum, Louvre, Louvre Museum, science museum, 
repertory, sperm bank, storehouse, depot, entrepot, storage, store, granary, garner, 
magazine, powder store, powder magazine, railhead, treasure house, warehouse,' 
storage warehouse, godown, treasury 

For completeness, (28) shows the result of a Web search on the bleaching of warehouse . 18 

(28) housing, bonding warehouse, storage, building, hanger, apartment, silo, shed, bin, 
room, cubicles, basement, institution, jail, prison, facility, nursing home, resposi- 
tory, shelter, brothel, pediatric ward, universities, research laboratories, barracks, 
hostel, retail outlets, state hospitals, orphanages, stall, archives, libraries, muse¬ 
ums, stores, factory, distribution centre, asylums, government schools, insecure 
wing, winter quarters, housing projects, ho3pital hallways, sanatoria, tenement 
hotel, study halls, boarding house, sanctuary, classroom 

In Figure 6(a) however, the noun spindle represent a specialized concept and thus is 
not subject to extensive bleaching. Applying bleaching rule (18) results in the follow¬ 
ing (simplified) lists, assuming that rod and stick are the uppermost functional nodes. 
Our rule predicts limited bleaching, producing robust and reasonable candidates (though 
incomplete), confirmed by the data in (29b). 

(29) a. Bleaching rule: baton, wand, rod, pole, boom, caber, mast, spar, stilt, 

ramrod, shaft, spindle, mandrel, arbor, axle, journal, thill, bow, club, staff 

b. Web data: rod, shaft, spear, incisor 

17 This example is taken from: Enter the Rambo Warrior, who shouts, u Yo, DragonI” and before you 
know it, the dragon has been spindled on a spear and is lying dead at the Warrior's feet. 

18 No attempt has been made in (28) to separate nouns for structures such as shed and silo from general 
locational labels such as sanctuary and jail There are also many cases of metaphorical use, some of these 
are: police files, databases, indexed flat file structures, relational tables, liquid nitrogen, portfolio, data 
repositories, mainframe, state foster care system, unconscious mind and bilingual education classes. 


358 







(a) (b) 


Figure 6: WordNet hierarchy for spindle and spear 

Finally, note th&t Figure 6(b) indicates that spear can only be reached in WordNet 
via the generic concepts of implement and instrumentality. WordNet distinguishes two 
senses of spear in terms of its functionality. The corresponding glosses for spearj and 
spear 2 are given in (30a) and (30b). 

(30) a. an implement with a shaft and a barbed point used for catching fish 
b. a long pointed rod used as weapon 

Although the definitions include the terms shaft and rod as components of a spear , both 
terms being candidates returned by the bleaching rule, there is no direct functional se¬ 
mantic relation here. Obviously, a sharpened shaft or rod can function as a spear. The 
notion of a indirect functional relation remains to be defined in future work. 


3 Conclusions 

In this paper, we have investigated how WordNet can be used to help formalize the 
notion of semantic bleaching as it applies to denominal verbs. We have defined, and 
refined over a series of examples incorporating varying degrees of bleaching, a bleaching 
rule formally defined over the WordNet hypernym/hyponym hierarchy. We have argued 
for a notion of functionality relevant to bleaching and, in particular, the need to tease out 
or distinguish functional isa-relations within the noun hierarchy. 

Address: NEC Research Institute, 4 Independence Way, Princeton NJ, USA. 

Email: sandiway@research.nj.nec.com 


References 

S. Fong. 2001. On mending a torn dress: The frame problem and wordnet. In Proceedings of 
the 2001 NAACL Workshop on WordNet and Other Lexical Resources: Applications, Extensions 
and Customizations , Carnegie Mellon University, Pittsburgh. 

A. Gangemi, N. Guarino, and A. Oltramari. 2001. Conceptual analysis of lexical taxonomies: 
The case of wordnet top-level. In Proceedings of FOIS 2001. 

P. Kiparsky. 1997. Remarks on denominal verbs. In A. Alsina, J. Bresnan, and P. Sells, editors, 
Complex Predicates , pages 473-499. CSLI Publications. 

B. Levin. 1993. English verb classes and alternations: A preliminary investigation. University 
of Chicago Press. 

M. Rappaport Hovav and B. Levin. 1998. Building verb meanings. In Butt and Geuder, editors, 
The Projection of Arguments: Lexical and Compositional Factors. CSLI Lecture Notes No. 83. 


359 










Cross-Linguistic Discovery of Semantic Regularity 


Wim Peters 
Louise Guthrie 
Yorick Wilks 


Abstract 

The question of whether metonymy carries across languages has always been interesting for language 
representation and processing. Until now attempts to answer this question have always been based on 
small-scale analyses. With the advent of EuroWordNet (Vossen 1998), a multilingual thesaurus covering 
eight languages and organized along the same lines as WordNet (http://www.cogsci.princeton.edu/~wn/) 
we have a unique opportunity to research this question on a large scale. In this paper we systematically 
explore sets of concepts comprising possible metonymic relations that have been identified in WordNet. 
The sets of concepts are evaluated, and a contrastive analysis of their lexicalization patterns in English, 
Dutch and Spanish is performed. Our investigation gives insight into the cross-linguistic nature of 
metonymic polysemy and defines a methodology for dynamic extensions of semantic resources. 

1. Introduction 

Viewed traditionally, metonymy is a non-literal figure of speech in which the name of one thing is 
substituted for that of another related to it. It has been described as a cognitive process in which one 
conceptual entity, the vehicle, provides mental access to another conceptual entity (Radden 1999). In its 
basic form, it establishes a semantic relation between two concepts that are associated with word forms. 
The semantic shift expressed by the relation may or may not be accompanied by a shift in form. The 
semantic relation that is captured by metonymy is one of semantic contiguity, in the sense that in many 
cases there are systematic relations between metonymically related concepts that can be regarded as slots 
in conceptual frames (cf. Fillmore 1977). 

For example, in the sentence ‘The colonies revolted against the crown.’ crown is used as a symbol for the 
monarchy as well as denoting the traditional head ornament worn by the monarch. As the example above 
shows, polysemy is a common way in which metonymically related concepts manifest themselves in 
language. 

Any systematic semantic relations between concepts expressed by these sense distinctions are lexicalized, 

1. e. they are explicitly listed in dictionaries and independent of a pragmatic situation. For example, The 
White House is on the one hand an institution and on the other a building. The semantic relation between 
the two senses is ‘is housed in’. 

Regular polysemy is a subset of metonymy that covers the systematicity of the semantic relations 
involved. It can be defined as a subset of metonymical .y related senses of the same word displaying a 
conventional as opposed to novel type of semantic contiguity relation. This relation holds for related 
senses of two or more words (Apresjan, 1973), i.e. is a lexicalized pattern, not a nonce formation (a 
pragmatically defined novel metonymy), and can therefore be called regular. It is this subtype of 
metonymy that we concentrate on in this paper. 

2. Regular Polysemy across Languages 

The question whether regular polysemy is a cross-1 iriguistic phenomenon has until now only been 
approached by small scale analyses. 



Kamei and Wakao (Kamei, 1992) approached the question from the perspective of machine translation 
and conducted a comparative survey of the acceptability of metonymic expressions in English, Chinese 
and Japanese consisting of 25 test sentences. The results they report show that in some cases English and 
Japanese share metonymic patters to the exclusion of Chinese, but that in others English and Chinese 
team up. 

(Setol996) performed a study into the lexicalization of the container-content schema in various languages 
(Japanese, Korean, Mongolian, Javanese, Turkish, Italian, Germanic and English). This pattern is 
lexicalized in English by ‘kettle’: 

1. A metal pot for stewing or boiling; usually with a lid 

2. The quantity a kettle will hold 

His observation was that the pattern is observable in all languages, and can be considered cross-linguistic. 
This small study seems to indicate that the regular polysemic pattern extends over language family 
boundaries to such an extent that it almost seems universal. This could suggest that the pattern is rooted in 
general human conceptualisation, and reflects an important non-arbitrary semantic relation between 
concepts or objects in the world. Indeed, if we describe the relation between container and content in 
terms of Aristotle’s qualia structure (Pustejovsky 1995), we see that it is the function of a container to 
hold an object or substance (telic role) and that a container is normally brought into existence for this 
purpose. 

More small-scale studies like the ones described above have been performed, mostly relying on 
introspection and small-scale dictionary analysis. A limited number of patterns that are valid in more than 
one language has been identified such as container/content and producer/product (Peters 2000). With the 
availability of WordNet and EuroWordNet it has become possible to investigate the cross-linguistic 
nature of metonymy on a large scale. 

3. EuroWordNet 

EuroWordNet (EWN) (Vossen .1997; Peters 1998) is .a multilingual thesaurus incorporating wordnets 
from eight languages: English, Italian, Dutch, German, Spanish, French, Czech, Estonian. The wordnets 
have been built in various ways. Some of them have been created on the basis of language specific 
resources and matched onto the original Princeton WordNet (Fellbaum 1998) when the interlingual 
relations were created. They therefore reflect the language specific lexicalization patterns and semantic 
organization. Others have been built from the start on the basis of a match between WordNet and 
bilingual dictionaries. In this case the conceptual structure is less language specific but can be regarded as 
the conceptual overlap between the structure of the English WordNet and the ontological structure 
associated with that particular language. 

EuroWordNet gives us for the first time the opportunity to examine the question of the language 
independence of regular polysemy in a more systematic and automatic way. 

4. Methodology 

The following methodology has been followed: 

First, the hierarchy of WordNetl.6 was analysed in order to obtain English candidates for regular 
polysemic patterns (section 4.1). Then a process we call lexical triangulation was applied to these data 
within EuroWordNet (section 4.2). The results were then manually evaluated. 


361 



4.1 Automatic Candidate Selection 


A technique was developed (Peters 2000) for identifying sense combinations in WordNet where the 
senses involved potentially display a regular polyseinic relation, i.e. where the senses involved are 
candidates for systematic relatedness. 

In order to obtain these candidate patterns WordNet (WN) has been automatically analysed by exploiting 
its hierarchical structure. Wherever there are two or more words with senses in one part of the hierarchy, 
which also have senses in another part of the hierarchy, then we have a candidate pattern of regular 
polysemy. The patterns are candidates because there seems to be an observed regularity for two or more 
words. This follows the definition of (Apresjan 1973) mentioned in the introduction. 

An example can be found in Figure 1 below. 


fabric 

(something made by weaving or 
felting or knitting or crocheting natural or 
synthetic fibers) 


covering hypemym combination 
(a natural object that covers or envelops) 


fleece words whose senses occur under both hypemyms 

hair 

tapa 

wool 


Figure 1: Words in WordNet covered by the pattern fabric/covering 

We have restricted our experiments to cases where the related meanings are of the same syntactic class 
(nouns). The procedure does not discover all regular polysemy relations, because the outcome is heavily 
dependent on the consistency of the encoding of these regularities in WordNet. 

4.2 Lexical Triangulation 

In order to determine whether regular polysemy is indeed a cross-linguistic phenomenon, one needs to 
compare languages, preferably from different language families. 

Data will depend heavily on vocabulary coverage in various languages, and until the advent of 
EuroWordNet no serious lexical data sets were available for analysis. The EuroWordNet database is the 
most comprehensive multilingual thesaurus to date. This resource not only provides us with an 
appropriate amount of lexical information in terms of vocabulary coverage, but also has the additional 
advantages that its taxonomic building blocks are identical for all languages involved and the language 
specific concepts are all linked to an interlingua which is based on the full set of the original Princeton 
WordNet (version 1.5), and is referred to as the interlingual index (ILI). 

We started with a comparative analysis of Germanic and Romance languages. The main reason for this is 
that the size of the corresponding wordnets is large enough to yield significant results. For our analysis 
we used three languages: English, Dutch and Spanish, hence the term for this process: lexical 
triangulation. 

Singling out areas where three language-specific lexicalization patterns converge enabled us to identify 


362 



metonymic patterns that supported the hypothesis that certain relationships between senses are inherent. 

We extracted the sense combinations of Spanish and Dutch words that participate in any of the potential 
regular polysemic patterns from the initial large set described in section 4.1. In other words, we 
concentrate here on lexicalization patterns in three different languages: sense combinations that are 
lexicalized by one language-specific word in English, Spanish and Dutch. 

The first step in this process was the reduction of the search space for regular polysemic patterns in 
EuroWordNet. First we determined the conceptual overlap for nouns between the English, Dutch and 
Spanish wordnets. Table -1 below shows the number of nouns in the three wordnets involved. 


Language 

Number of 
noun synsets i 

Number of 
corresponding ILI 
concepts 

English 

66025 

66025 

Dutch 

28352 

26779 

Spanish 

24073 

24087 


Table 1: Conceptual coverage of English, Dutch and Spanish wordnets 


The conceptual overlap between these wordnets is computed simply by determining the 
intersection of ILI noun concepts covered by each of the wordnets. The total overlap is 17007 
ILI concepts. 

There are 920 English polysemous nouns with two senses or more within synsets linked to this set of ILI 
concepts. Their senses have identical language specific lexicalizations in Spanish and Dutch. For 
example, the English word church has one sense that is a building and another that is an institution. The 
same sense distinctions apply to the Spanish iglesia and the Dutch kerk. The senses in the different 
wordnets are linked through the ILI concepts by means of equivalence synonymy or near-synonymy 
relations (Vossen 1997). 

The second step was to map these noun senses onto the results from the wordnet analysis described in 
section 4.1, and then to evaluate the cross-linguistic validity of the regular 

polysemic patterns that have been projected from the English monolingual wordnet onto the Dutch and 
Spanish wordnets. 

5. Evaluation 

The cross-linguistic filter yields a subset of the monolingual analysis data described in section 4.1. It 
covers 404 distinct English nouns out of a total of 8062 (5%). 

This original filter considered nouns satisfying the criteria of Apresjan (cf. section 1), i.e. they are one 
of at least 2 words with sense distinctions that exhibit a particular relationship. 

The percentage covered by the cross-linguistic data compared to the original analysis gradually varies 
from a 100% for the very small potential classes of regular polysemy (2-3 words) to 1-2% for middle 
sized (30-50 words) and large classes (100+ words). 

In order to create a set for manual evaluation, the set of 404 English nouns was reduced by strengthening 
the Apresjan criterion and requiring that a word be considered only if it was one of at least a three word 
set illustrating the regular polysemy (RP). We will refer to this as a three word RP class. The rationale 


363 



behind this was that two word candidate RP classes introduce noise because of the increased probability 
of a fortuitous coincidence of senses belonging to a set of just two words. This step reduced the number of 
participating words to 394. At this point, 177 words were randomly chosen from this set for manual 
evaluation. 

The evaluation consisted of examining the hypemym pairs that reflect a candidate regular polysemic 
relation. The criteria used in this step are semantic homogeneity (the semantic relation that defines the 
candidate RP class should apply to the majority of the participating words) and specificity of the pattern 
(the lower the position of the hypemymic pair in the hierarchy, the more specific the semantic relation). 

109 of these words displayed valid regular polysemic patterns (62%), 68 did not (38%). 

This means that by means of this automatic filtering method we have a 62% success rate for identifying 
vaJid regular polysemic patterns. Below are a few examples of cross-linguistic RP classes that have 
satisfied the criteria of the evaluation. 


Hypernymlc Pair: Control (the activity of managing or exerting control over something) - Trait (a 
distinguishing feature of one's personal nature) 

English RP class (7 total): abstinence, sobriety, inhibition, restraint, self-control, self-denial, self- 
discipline 

Dutch RP class (2 total): zelfcontrole, onthouding 

Spanish RP class (3 total): autodiscipline, abstinencia, abnegacion, inhibicion 

Coverage of the intersection between all three languages: 36% of set derived from WordNet 

Hypernymlc Pair: Fabric (something made by weaving or felting or knitting or crocheting natural or 
synthetic fibers) - Covering (a natural object that covers or envelops) 

English Rp class (4 total): wool, hair, fleece, tapa 
Dutch RP class (1 total): wol 
Spanish RP class (1 total): lana 

Coverage of the intersection between all three languages: 25% of set derived from WordNet 

Hypernymlc Pair: Plant (a living organism lacking the power of locomotion) - Edible fruit (edible 
reproductive body of a seed plant especially one having sweet flesh) 

English RP class (159 total): apple, boxberry, blackcurrant, banana, fig ... 

Dutch RP class (9 total): banaan, vijg, persimoen, meloen... 

Spanish RP class (20 total): banana, platano, meldn, caqui, higo... 

Coverage of the intersection between all three languages: 2.5% of set derived from WordNet 

Hypernymlc Pair: Person (a human being) - Quality (an essential and distinguishing attribute of 
something or someone) 

English RP class (11 total): attraction, authority, beauty,. .. 

Dutch RP class (1 total): schoonheid 

Spanish RP class (4 total): belleza.atraccidn, autoridad, imagen 

Word intersection between all three languages: 9% of set derived from WordNet 


3S A complication arises because many combinations of hypemym pairs can be considered for the same set of words. 
(In fact the possibilities are the Cartesian product of the ancestors of each of the hypemyms in the pair). If all 
hypemymic combinations were taken into account this amounts to an average of 17 classes per word. 


364 



Hypernymic Pair: Substance (that which has mass and occupies space) - Drug (something that is used 
as a medicine or narcotic) 

English RP class (25 total): alcohol, bromide, dragee, histamine, iodine, liquor ... 

Dutch RP class (2 total): broom, cocktail 

Spanish RP class (10 total): bromuro, histamina, muscatel, yodo... 

Word intersection between all three languages: 4% of set derived from WordNet 

Hypernymic Pair: Occupation (the principal activity in your life) - Discipline (a branch of knowledge) 
English RP class (6 total): architecture, literature, politics, law, theology, interior design 
Dutch RP class (1 total ):arcliitectuur 
Spanish RP class (2 total): arquitectura, teologla 

Word Intersection between all three languages: 16% of set derived from WordNet 

6. Universality of Regular Polysemy 

It is possible to view these results as an indication of the cross-linguistic validity of the regular polysemic 
patterns and their level of universality relative to the language families represented by the wordnets. The 
hypothesis is that if a metonymic pattern occurs in several languages, there is stronger evidence for a 
higher level of universality of the regular polysemic pattern. 

Of course there is interference with the coverage of the wordnets in Euro WordNet. Since the Dutch and 
Spanish wordnets are only half the size of the English wordnet only limited coverage can be expected. 
Still, the coverage seems to be consistently low in most cases, often not more than 2-5%. On the basis of 
wordnet size only one would expect a higher coverage. 

There are other explanations for the lack of identical lexicalizations in other target language wordnets: 

1. The metonymic pattern is language specific, and is not realised as a polysemous word in the target 
language. For example, the Dutch kantoor is synonymous to the English office in the sense ‘where 
professional or clerical duties are performed’, but its sense distinctions can not mirror the regular 
polysemic relation in English with ‘a job in an organization or hierarchy’. 

2. The pattern is unattested in the target language in terms of usage but forms a potential sense extension 
in that language. For instance, the Spanish iglesia and the Dutch kerk both mean ‘building for worship’ 
and ‘a service conducted in a church’. The Spanish wordnet has an additional systematically related sense 
for iglesia (‘institution to express belief in a divine power’) that is not shared by its Dutch counterpart but 
is a valid new sense. 

3. The missing sense can in fact only be lexicalized by another word or compound or derivation related 
to the word with the potentially missing sense. For example, the Dutch vereniging has the sense (an 
association of people with similar interests). The English equivalent is club , for which there is another 
sense in Wordnet (a building occupied by a club). This is not a felicitous sense extension for the Dutch 
vereniging, because the favoured lexicalization is the compound verenigingshuis (club house). 

4. The metonymic pattern is in fact attested in the language, but one or more senses participating in the 
patterns has not yet been captured in the wordnet. One of the reasons could be the sense granularity of the 
resource on the basis of which the wordnet has been built. For example, embassy has one sense in 
WordNet (a building where ambassadors live or work). The Dutch translational equivalent ambassade has 
an additional sense denoting the people representing their country. This sense can be projected to the 
English WordNet as a regular polysemy pattern that is also valid in English. In fact, LDOCE 
(Procter, 1978) only lists the sense which is missing in WordNet. 


365 



7. Coverage and Extendibility 


There are many RP classes whose English word members do not all have a Dutch or Spanish counterpart. 
We wanted to evaluate the universality of the regular polysemic relations by testing native speaker 
intuitions about these regular polysemic gaps. This was done by projecting the senses of the participating 
English words in an RP class onto Dutch and Spanish,, and to assess whether the missing senses were 
adequate additional senses in these two languages. 

The experiment we conducted was very small. We intend to perform more experiments of this kind in the 
future. The pattern we examined is the hypemymic combination occupation (the principal activity in 
your life) - discipline (a branch of knowledge). This RP class has five members. Two Dutch and two 
Spanish native speakers were asked to judge the felicitousness of the senses that are missing in the Dutch 
and Spanish wordnets. Below is a short discussion of each member. 

interior design 

1. the trade of planning the layout and furnishings of an architectural interior 

2. the branch of architecture dealing with the selection and organization of furnishings for an 
architectural interior 

The corresponding Dutch word binnenhuisarchitectuur has only one sense which is linked to both 
WordNet senses by means of a near-synonymy relation. This means that the Dutch wordnet is 
underspecified for the distinction of these metonymically related senses and can be extended with the 
specific sense distinctions (see explanation 4 above). This coincided with the verdict of the Dutch jury. 
Ihe Spanish WordNet has a separate translation for each sense: interiorismo (corresponding to interior 
design 1) and deseno de interiores (corresponding to interior design 2). The latter translational equivalent 
was considered to also have a possible trade reading. 

law 

1. the learned profession that is mastered by graduate study in a law school and that is 

responsible for the judicial system 

4. the branch of philosophy concerned with the law 

The Dutch word rechtswetenschap also has only one sense which is linked to both WordNet senses by 
means of a near-synonymy relation. This again means that the Dutch wordnet is underspecified for the 
distinction of these metonymically related senses and can be extended with the specific sense distinctions 
(see explanation 4 above). This coincided with the verdict of the Dutch jury. 

The Spanish equivalent of law 4 is jurisprudence, whereas law 1 does not have a correspondence in the 
Spanish wordnet. The profession reading was not considered a felicitous additional sense for this word. 
Both subjects remarked that another word captures both meaning: leyes, which is not present in the 
Spanish wordnet. 

literature 

1. the profession or art of a writer 

2. the humanistic study of a body of literature 

The Dutch letterkunde is only linked up to sense literature no. 2. Sense no. 1 was not considered to be a 
straightforward new sense for this word by the judges. 

The Spanish literatura lacks a profession reader in the Spanish wordnet. This sense was considered as 
valid by one subject, but rejected by the other subject. 


366 



Collaborators for the Conference 

Global WordNet Association, The Netherlands 
Indian Institute of Technology, Bombay 
Indian Institute of Information Technology, Hyderabad 
Central Institute of Indian Languages, Mysore 



